Added rank check for saving network snapshots in training_loop

skymanaditya1 commented 2 years ago

Added (rank == 0) check while saving network snapshot. Otherwise, this causes the network snapshot to be saved and consequently the steps inside to be performed multiple times.

universome commented 2 years ago

Hi Aditya, thank you for your PR!

To be honest, I am not sure about this change since it has the DDP consistency check inside, which should have all processes to enter the branch (otherwise they won't be checked for DDP consistency). And as to saving the checkpoint, there is a rank == 0 check on this line inside. We inherited this part of the code from NVidia's repo from here. Which steps are you referring to when saying "steps inside to be performed multiple times"?

skymanaditya1 commented 2 years ago

I see, that makes sense Ivan!

I had actually added a bunch of print statements inside and they were printing multiple times on a multi-gpu setup. I see now that the rank check is provided inside. I guess I should have checked the code more thoroughly, but I am more familiar with the code following the vqvae2, INR style. :)

Although, I see now that adding that extra (rank == 0) check causes issues, the GPU memory utilization on all 4 GPUs reaches 100% and the training doesn't proceed further. It could also have been because of my shorter snap times (to debug), or the video saving code inside the same file which caused a SIGKILL (not sure if in isolation or in tandem). I removed both the codes (rank==0 and video saving), and reset the snap value to default (50), and the model seems to run fine now.

Thank you for being very patient and congrats on the amazing work! Looking forward to your future research.

universome / stylegan-v

Added rank check for saving network snapshots in training_loop #15