Inquiry about detailed training settings

giannisdaras / ylg

[CVPR 2020] Official Implementation: "Your Local GAN: Designing Two Dimensional Local Attention Mechanisms for Generative Models".

GNU General Public License v3.0

136 stars 18 forks source link

Inquiry about detailed training settings #2

Closed Gsunshine closed 4 years ago

Gsunshine commented 4 years ago

Hey, congrats on the acceptance of this intriguing paper!

I am working on reproducing YLG and SAGAN on ImageNet 128*128 but get a bit confused on the detailed training settings. I run the code on TPUv3-8 with batch size 256, evaluate every 2500 steps with batch size 1024 and 49 steps (50176 samples in total), and set prediting batch size to 1024. Default TTUR with imbalanced learning rates and 1:1 update are applied.

Update:

However, the training processes seem to show approximately the same performance in first 100K iterations. See the figure below. Is there anything wrong in my training settings?

I would really appreciate it if you could provide any suggestions!

Update: The deep blue line is SAGAN and the other is YLG. The reported real inception score is 159.2. It seems right. I also checked code to ensure it ran the correct model rather than the same one and the observed steps/sec of two runs are actually different.

giannisdaras commented 4 years ago

Hey! Thanks for your interest in replicating this, I believe it is really imporant for research to be able to repllicate paper results :)

I noticed that you use batch size 256 for training. If I remember correctly, I used batch size 1024. (see also the intrusctions in tf-gan library). Note that batch size for SAGAN is really imporant, i.e. check out BigGAN paper which mentions that a batch size increase leads to a significant improvement of FID/Inception.

I believe an initial source of confusion is that in Pytorch, when you define a batch_size on TPUs its' interpreted to be the effective batch size, however, on GPUs batch_size is batch size per core, which means you have to divide the expected batch size by the number of cores. If I do remember correctly, in SAGAN they use 4 GPUs with batch size 256 which is equivalent to running the TPU version with batch size = 1024.

Also, I assume you are using train branch for training, right?

Gsunshine commented 4 years ago

Much thx for your suggestion!

Yeah. I use the train branch for replicating your work. I also noticed that you have already fixed the bug in calculating FID & IS and started an issue in tfgan. That's really helpful! :) To be honest, I met great difficulties in finding downloadable ImageNet2012 since they altered the downloading protocol and COVID-19 stopped them to agree my downloading application. QAQ

Now everything is OK. I will take batch size 1024 to train. Your patience and suggestions greatly helps!

BTW, how long will it take to train YLG with 1M steps, training batch size 1024, evaluating batch size 1024 and 1000 steps per eval on TPUv3-8? My TFRC program will come to an end. I don't know if it is possible to reproduce both two models and also train my owns. Maybe I need to email the TFRC team to apply for prolonging the period of validity. Research on generative models especially GANs is really expensive. :)

giannisdaras commented 4 years ago

I see in the docs of the tf-gan library that training on a single TPUv3-8 takes about 3 full days with the specified batch size. In my experience with YLG training takes a bit longer, probably because of the mask computation. However, due to the locality bias in the YLG model, I expect the training to take less steps to finish, as mentioned in the paper, so you can expect some good results sooner. I will close this issue for now but if you have any further questions, feel free to re-open and I would be glad to help!