GPU memory shortage problem when loading weights from checkpoints

junikkoma commented 3 years ago

Hello, I would first like to thank you for sharing your work.

I am having problem on loading weights from checkpoints(i.e on continuing halted training)

I am training StyleMapGAN on custom dataset(~200K images in the training dataset, 1024*1024 resoulution), and I am currently using 3 TitanRTX GPUs. I am using latent_spatial_size=16 considering input image resolution and GPU memory. On training with such configuration, batch 2 is allocated per GPU using ~21 GiB memory.

There is no problem on training from scratch. I have not tried using pretrained weights trained on FFHQ or CelebA because my data is quite different from human faces. Moreover, as I have succeeded on generating images from generate.py, I think weights were saved in proper way.

However, memory allocation problem occurs every time I load custom weights to continue training. I assumed extra memory may be required on loading weights, so I tried using smaller batch size (batch 2 per GPU->batch 1 per GPU), but same memory shortage problem occurs.

To summarize, I cannot load weights to continue training, whereas training from scratch or loading weights to generate images are working well. Thereafter, I would like to ask following questions.

Had any of the authors experienced with similar problems?
Would there be any possible solutions to my problem?

I would be grateful if you take a look into my question. Thank you!

blandocs commented 3 years ago

Thank you for the great discovery.

I fix the problem, and you can check the commit.

junikkoma commented 3 years ago

Thank you for your prompt reply.

naver-ai / StyleMapGAN

GPU memory shortage problem when loading weights from checkpoints #11