Reproducibility for CelebA

mcbuehler commented 4 years ago

We have not been able to reproduce the results given the code in this repository. Here is what we have tried.

We loaded the provided pre-trained weights for the models “CelebA_128x128_N2M2S64” and “CelebA_64x64_N2M2S32” and ran inference. However, the generated images do not look as good as the ones in the paper and the calculate FID is several orders of magnitude higher than expected (around 300). The only modification we made to the code was to replace the scipy.misc.imread with imageio.imread.
In addition, we re-trained the 64x64 model with the configuration you provided (“configs/CelebA_64x64_N2M2S32.yaml”). We had to change the variable basic_layers on line 51 in model/discriminator.py from [2, 4, 8, 8] to [1, 4, 8, 8] to match the pre-trained weight dimensions. This experiment also yielded high FID and non-realistic images.

How did you train the provided weights? Did you use the private codebase? What might be a reason why we cannot reproduce the results?

Thank you.

hubert0527 commented 4 years ago

I'm sincerely sorry about the issue.

Ideally, this published version should be equivalent to my private one, and there is no additional training trick in my private codebase. One of the changes I made is that, I made the generator and discriminator construction more generic, which I hard-coded them in my private codebase. And, unfortunately, introduces the bug you found during the code publishing process.

After fixing this (I have pushed the fix to Github), I can load the CelebA 64x64 checkpoint, and the results look fine to me. sampled_full_398_1668

Could you provide:

Your TF version
The samples you generated (both loaded from ckpt and trained-from-scratch)

mcbuehler commented 4 years ago

Thank you for the fix.

I use tensorflow-gpu==1.9.0.

Samples generated by checkpoint 'CelebA_64x64_N2M2S32':

Samples generated by trained-from-scratch 'CelebA_64x64_N2M2S32' (epoch 68):

hubert0527 commented 4 years ago

It looks kind of okay, what is the FID score you got? For both of them, the FID score should obviously less than 100. If that's the case, probably there are some problems with the FID calculation. For the latter one, I think you just need to train it longer.

mcbuehler commented 4 years ago

The FID calculated for the above models was way higher than 100, but I used a smaller sample size for the computation due to GPU memory constraints. I can compute it on 50K samples and report the numbers when it is done. In my perception, the images in your comment above look considerably more realistic.

hubert0527 commented 4 years ago

but I used a smaller sample size for the computation

I can't quite get this. For FID calculation, we first extract Inception features batch-by-batch, then compute the FID score for 50K samples at once in CPU. Ideally, there shouldn't be any difference when you extract Inception features with a smaller batch size.

Note that, I get the results in my previous comment by cloning the latest code and running with my local machine (which has TF 1.9.0). I think that's just a sampling bias.

mcbuehler commented 4 years ago

I think that's just a sampling bias.

I agree that we should get an FID in at least a similar range. However, we recalculated the FID for the provided pre-trained models on 50K images. For CelebA, we get FID 292 and for LSUN we get FID 365.

hubert0527 commented 4 years ago

Sorry, let me clarify, there were two different points in our discussion:

The reason why I mentioned the sampling bias is that you mentioned your generated samples are slightly worse than what I provided. To this problem, it is just sampling bias.
The other problem is that you got a 292 FID on CelebA, which is clearly a bug (but not reproducible on my side). With FID that high, the images should be extremely crappy. I suppose that the problem is either your data is somehow problematic (e.g., dataset corrupted, the image format or value range returned by your modified image loading function is different from my original implementation, or maybe there were some errors that causes your pre-calculated FID statistics wrong).

I suppose that you switch to imageio since you can't import scipy.misc.imread. That's because scipy removed misc after version 1.2.0. You may try to downgrade scipy to 1.1.0 and try again to see (including re-compute the FID statistic for real data) if the resulting FID becomes normal. If that still does not work, you may upload your pre-calculated FID stats file (for real data) here and let me check if the values are identical to mine.

Thanks!

mcbuehler commented 4 years ago

I recalculated the FID loading images with scipy.misc.imread and I got again a high value (295).

Here is a link to the pre-calculated FID statistics on 50K images from the CelebA dataset. Could you please check if these values are in a similar range as your pre-computed statistics?

hubert0527 commented 4 years ago

My apology. I have checked your statistics, and the results mismatched with mine. Furthermore, I can reproduce the issue now, there should've been some bugs in my data preprocessing code. I didn't particularly verify the correctness of preprocessing, and I accidentally used my caches to verify the correctness of the model implementation, which causes the issue was transparent to me.

I will investigate this issue and get back to you ASAP. However, it will take me some time, as I'm a little busy now.

Sorry again for all the inconvenience I caused.

hubert0527 commented 4 years ago

Hi @mcbuehler , sorry for my mistakes causing your waste of time. The issue should have been resolved. There were two bugs introduced during the code publishing process, which causes (i) the FID statistics is wrong, and (ii) the pretrained model is damaged while loading from the checkpoint. I can reproduce (freshly run from scratch) the FID score of the pretrained CelebA 64x64 model now.

Note that you will have to re-compute the FID statistics after pulling from Github. And the latest code is not completely backward compatible. To secure the model performance, you may need to re-train the generators that were trained by yourself. Though I expect there wouldn't be too much difference in terms of performance.

Please kindly let me know if there is still any issue.

hubert0527 commented 4 years ago

Hi @mcbuehler, Can you reproduce the FID now?

mcbuehler commented 4 years ago

Yes, I get FID 3.5 for the provided pre-trained model (CelebA_64x64_N2M2S32). Thank you for the bugfix.

hubert0527 / COCO-GAN

Reproducibility for CelebA #8