Reproducing CIFAR10 FID Scores

kwotsin commented 5 years ago

Hi @takerum, thanks for providing the code for the papers. I have several questions regarding reproducing the FID score of CIFAR10 in your cgan paper (page 14, table 3):

Is the FID score produced using the full 50k training data statistics in CIFAR10, as found in the chainer-gan-lib repository (https://github.com/pfnet-research/chainer-gan-lib/blob/master/common/cifar-10-fid.npz)? Otherwise, how many real and fake images are being used for computing the score in table 3?
Is the FID score computed using the the evaluation.py function found in (https://github.com/pfnet-research/sngan_projection/blob/master/evaluation.py#L220)? Because it points to the statistics file that is used in the chainer-gan-lib repository.
Alternatively, I have found a relevant issue (https://github.com/pfnet-research/sngan_projection/issues/34) that shares the FID statistics, but it seems to be used for intra-fid computation and for just imagenet instead. Would you have the CIFAR10 statistics used in the experiments you could share (or is it the same as the statistics from chainer-gan-lib)?

I have tested with the FID score for your pre-trained models (conditional/unconditional CIFAR10) with 4 variations: choosing from [your_fid_computation_code, official_TTUR_fid_code] and [FID_stats_in_chainer_gan_lib, official_TTUR_fid_train_stats], and found there are quite a few points of difference in the scores, which might be due to the statistics/code being used. Thus it will be very helpful to clarify with you the above questions.

If I have read your papers correctly, I believe SNGAN (unconditional) for cifar10 has scores computed using 10k-5k FID (from appendix B.1) -- is there any supporting code in the repository for reproducing this results?

Thank you for the help.

abdulfatir commented 4 years ago

Hey @kwotsin

Were you able to reproduce the FID scores reported (21.7) in the SNGAN paper for CIFAR10? I have been trying to do that for the last couple of days but I am unable to get a value close to it. I am getting values between 14 - 18, depending on the number of real and fake samples I use to compute the score. Note that I am not using a re-implementation of SNGAN, but am just generating samples using the weights this repo provides.

By the way, awesome work with the Mimicry repo!

kwotsin commented 4 years ago

For reproducing the 21.7 score, unfortunately I was not able to get exactly that score, and I think it is expected because FID produces highly biased estimates depending on the number of samples. I would expect at 10k-5k configuration for no. of real/fake samples, the variance of scores is higher. Using more samples would be give a result with less variance (see https://arxiv.org/abs/1801.01401).

Values of around 14-18 is possible if you use a larger number of samples, but it would be important to use the same number of samples for evaluating all GANs, especially since different papers use different configurations, which can lead to inflated results. In practice, I find using KID gives results that are more consistent across different runs.

abdulfatir commented 4 years ago

Thanks for your reply. I am aware of the variation of FID score with the number of samples used. I just wanted to check if there's specific configuration to reproduce a value near 21.7. Thank you for the practical insight about KID.

Cheers!

pfnet-research / sngan_projection

Reproducing CIFAR10 FID Scores #54