kwotsin / mimicry

[CVPR 2020 Workshop] A PyTorch GAN library that reproduces research results for popular GANs.
MIT License
602 stars 62 forks source link

Evaluation on the test set #33

Open liang-hou opened 4 years ago

liang-hou commented 4 years ago

Hi, thanks for the excellent work. But I noticed that the test set is not available for evaluation, because you don't specify the split argument in the metrics functions. I think it will load the train set for evaluating FID by default. Many GAN papers including SSGAN splits the data set as a train set and a test set. I guess that they evaluated their models using the test set.

Another minor concern is that the inception model will be downloaded and stored in each experiment, which is a waste of time and storage. It would be better to discard the log_dir from the inception_path. Then the inception model will be cached and reused for all experiments.

kwotsin commented 3 years ago

Hi @houliangict, yes you are right there are differences in how the field computes FID these days. Recall that since the goal for GANs in this case is to model the training distribution, evaluating on the training data is seen as a way to evaluate GAN performance on such modelling. There are other variants like using test data for evaluation instead (if I recall, particularly for works from Google), and some works propose to use both "train" and "test" splits for training/evaluation together, but I can't say for sure what is the recommended way since there doesn't seem to be a consensus in the field.

Regardless, I think it is important to be consistent in evaluation if comparing between various models. From some initial experience running FID on test dataset (e.g. on CIFAR-10 train vs test), I noticed the margins are very comparable.

That said, I think your suggestions are great and I will certainly look into them! Hope this helps!