Evaluation on the test set

kwotsin / mimicry

[CVPR 2020 Workshop] A PyTorch GAN library that reproduces research results for popular GANs.

MIT License

602 stars 62 forks source link

Hi @houliangict, yes you are right there are differences in how the field computes FID these days. Recall that since the goal for GANs in this case is to model the training distribution, evaluating on the training data is seen as a way to evaluate GAN performance on such modelling. There are other variants like using test data for evaluation instead (if I recall, particularly for works from Google), and some works propose to use both "train" and "test" splits for training/evaluation together, but I can't say for sure what is the recommended way since there doesn't seem to be a consensus in the field.

Regardless, I think it is important to be consistent in evaluation if comparing between various models. From some initial experience running FID on test dataset (e.g. on CIFAR-10 train vs test), I noticed the margins are very comparable.

That said, I think your suggestions are great and I will certainly look into them! Hope this helps!

kwotsin / mimicry

Evaluation on the test set #33