Closed GloryyrolG closed 2 years ago
We used the same code as: https://github.com/mseitzer/pytorch-fid
The only difference the may be is the use of BatchNorm. When we train on CIFAR-10, the batch size is 32, and the BatchNorm statitics are not calculated correctly, so using model.eval()
mat cause a degradation in performance (https://discuss.pytorch.org/t/performance-highly-degraded-when-eval-is-activated-in-the-test-phase/3323/67).
Hi Daniel @taldatech ,
Thanks for your instant reply. Yeah, I used the train()
mode. What I did was I manually generated images saved to a folder, and then "compare" them to CIFAR-10 images. Btw, could you share some generated images to help check if there is any problem.
Maybe there is a a degradation due to the compression you used to save the images and not directly used the tensor outputs. Please take a look at your code (or share it) and the code in this repository and point out where you think there is a mismatch. We used the same base repository for the FID, so the only problem may be in the way you generate images.
Hi Daniel @taldatech ,
I found InceptionV3 is turned the eval
mode off, i.e., under the train
mode when calculating FID. https://github.com/taldatech/soft-intro-vae-pytorch/blob/b841bad3bb779244e8efca28f04f142ccd5923f8/soft_intro_vae/metrics/fid_score.py#L176 After uncommenting these statements, I got a result of 26.69 consistent with the previous report. I think it is reasonable to set S-IntroVAE to the train
mode when generating images. But we should not change that of InceptionV3, right? Since it is directly related to the generation quality. When turning InceptionV3's train
mode on, it is no wonder that the model can achieve a good performance. Thanks.
Yes, thank you for noticing, I don't rememnber the exact reason we did that, but I saw some discussion online that for some models it should be turned off. I'm not sure why the performance is so different (but as such, the hyper-parameters should be tuned differently for this setting). The regular architecture, which uses BatchNorm, is based on the original IntroVAE, and after many experimetns, I'm not sure it is the best architecture. After moving to the Style architecture, we found it easier to work with. As you can see here:
https://github.com/taldatech/soft-intro-vae-pytorch/blob/b841bad3bb779244e8efca28f04f142ccd5923f8/style_soft_intro_vae/metrics/fid_score.py#L213
We used model.eval()
for the Style version, for whch we report most of our image results.
My advice is to either change the architecture or try to re-tune the hyper-parameters differently (I don't recommend this option). I hope this helps. Soft-IntroVAE is a way to train VAEs, the architecture can vary, so if you are familiar with a good architecture, you can just add the Soft-IntroVAE loss and see if it makes any difference.
Hi Daniel @taldatech ,
I found in the CIFAR-10 exp, generation quality of the checkpoint you provided is like,
The first two rows are the training data; the following two rows are the reconstruction; the last four rows are the generation.
Using the arg
--fid
, I got a similar result as the paper reports, 4.37. But actually, generation quality does not seem as good, so I manually recompute FID using repopytorch_fid
and got a result of 25.86 which I think may be more reasonable. Similar phenomena are observed on some other datasets such as CelebA. So I suspect there might be some bugs in the FID calc? Correct me if I were wrong.Thanks,