Potential Bugs in the FID Calc?

GloryyrolG commented 2 years ago

Hi Daniel @taldatech ,

I found in the CIFAR-10 exp, generation quality of the checkpoint you provided is like,

The first two rows are the training data; the following two rows are the reconstruction; the last four rows are the generation.

Using the arg --fid, I got a similar result as the paper reports, 4.37. But actually, generation quality does not seem as good, so I manually recompute FID using repo pytorch_fid and got a result of 25.86 which I think may be more reasonable. Similar phenomena are observed on some other datasets such as CelebA. So I suspect there might be some bugs in the FID calc? Correct me if I were wrong.

Thanks,

taldatech commented 2 years ago

We used the same code as: https://github.com/mseitzer/pytorch-fid The only difference the may be is the use of BatchNorm. When we train on CIFAR-10, the batch size is 32, and the BatchNorm statitics are not calculated correctly, so using model.eval() mat cause a degradation in performance (https://discuss.pytorch.org/t/performance-highly-degraded-when-eval-is-activated-in-the-test-phase/3323/67).

GloryyrolG commented 2 years ago

Hi Daniel @taldatech ,

Thanks for your instant reply. Yeah, I used the train() mode. What I did was I manually generated images saved to a folder, and then "compare" them to CIFAR-10 images. Btw, could you share some generated images to help check if there is any problem.

taldatech commented 2 years ago

Maybe there is a a degradation due to the compression you used to save the images and not directly used the tensor outputs. Please take a look at your code (or share it) and the code in this repository and point out where you think there is a mismatch. We used the same base repository for the FID, so the only problem may be in the way you generate images.

GloryyrolG commented 2 years ago

Hi Daniel @taldatech ,

I found InceptionV3 is turned the eval mode off, i.e., under the train mode when calculating FID. https://github.com/taldatech/soft-intro-vae-pytorch/blob/b841bad3bb779244e8efca28f04f142ccd5923f8/soft_intro_vae/metrics/fid_score.py#L176 After uncommenting these statements, I got a result of 26.69 consistent with the previous report. I think it is reasonable to set S-IntroVAE to the train mode when generating images. But we should not change that of InceptionV3, right? Since it is directly related to the generation quality. When turning InceptionV3's train mode on, it is no wonder that the model can achieve a good performance. Thanks.

taldatech commented 2 years ago

Yes, thank you for noticing, I don't rememnber the exact reason we did that, but I saw some discussion online that for some models it should be turned off. I'm not sure why the performance is so different (but as such, the hyper-parameters should be tuned differently for this setting). The regular architecture, which uses BatchNorm, is based on the original IntroVAE, and after many experimetns, I'm not sure it is the best architecture. After moving to the Style architecture, we found it easier to work with. As you can see here: https://github.com/taldatech/soft-intro-vae-pytorch/blob/b841bad3bb779244e8efca28f04f142ccd5923f8/style_soft_intro_vae/metrics/fid_score.py#L213 We used model.eval() for the Style version, for whch we report most of our image results. My advice is to either change the architecture or try to re-tune the hyper-parameters differently (I don't recommend this option). I hope this helps. Soft-IntroVAE is a way to train VAEs, the architecture can vary, so if you are familiar with a good architecture, you can just add the Soft-IntroVAE loss and see if it makes any difference.

taldatech / soft-intro-vae-pytorch

Potential Bugs in the FID Calc? #14