Open liang-hou opened 4 years ago
Hi @houliangict, yes you are right there are differences in how the field computes FID these days. Recall that since the goal for GANs in this case is to model the training distribution, evaluating on the training data is seen as a way to evaluate GAN performance on such modelling. There are other variants like using test data for evaluation instead (if I recall, particularly for works from Google), and some works propose to use both "train" and "test" splits for training/evaluation together, but I can't say for sure what is the recommended way since there doesn't seem to be a consensus in the field.
Regardless, I think it is important to be consistent in evaluation if comparing between various models. From some initial experience running FID on test dataset (e.g. on CIFAR-10 train vs test), I noticed the margins are very comparable.
That said, I think your suggestions are great and I will certainly look into them! Hope this helps!
Hi, thanks for the excellent work. But I noticed that the test set is not available for evaluation, because you don't specify the
split
argument in the metrics functions. I think it will load the train set for evaluating FID by default. Many GAN papers including SSGAN splits the data set as a train set and a test set. I guess that they evaluated their models using the test set.Another minor concern is that the inception model will be downloaded and stored in each experiment, which is a waste of time and storage. It would be better to discard the
log_dir
from theinception_path
. Then the inception model will be cached and reused for all experiments.