Training Getting Stuck on First Iteration

mfredriksz commented 3 years ago

Hello,

I am facing this issue while running train_seg_gan.py (both on single and multiple GPUs) where the training will get to the first iteration and then get stuck there. My GPU utilization remains constant and there is no further logging.

This is the output I am getting:

==================Start calculating validation scores==================
d_img val scores: -1.5312, d_seg val scores: 0.1412
==================Start calculating FID==================
Gathering activations...
Calculating Inception Score...
Calculating means and covariances...
Covariances calculated, getting FID...
iteration 00000000: FID: 332.5834, IS_mean: 2.0889, IS_std: 0.0289

Once it reaches this point, nothing further happens. I ended up canceling the run after an hour of being stuck here. I am using the CelebAMask dataset for training.

Pytorch 1.4.0, CUDA Version: 11.0, Python 3.6.13

I appreciate any help you're able to provide me with!

mfredriksz commented 3 years ago

Closing the issue because I realized that the training was running, there just wasn't any logging occurring. If someone else faces this, just look in your output directory. There should be sample images that get updated throughout training.

DanielTakeshi commented 2 years ago

@mfredriksz Just curious, which GPU are you using for training this? How much GPU memory do you need to use to train this and does the code use multiple GPUs?

nv-tlabs / semanticGAN_code

Training Getting Stuck on First Iteration #5