nv-tlabs / semanticGAN_code

Official repo for SemanticGAN https://nv-tlabs.github.io/semanticGAN/
MIT License
180 stars 34 forks source link

Training stuck on g_regularize/path length regularization step, works when disabled #26

Open snibbor opened 2 years ago

snibbor commented 2 years ago

Hello,

Thank you for this fantastic work and this repository. I am working on modifying the code to work with 1024x1024 patches from whole slide images to generate patches and corresponding annotation masks. I was successful at implementing a dataloader for this dataset and tweaking the code to work with it so far, but I am trying to troubleshoot why the training gets stuck on the g_regularize step for this dataset. The code is able to calculate the path loss, but it gets stuck trying to calculate the weighted_path_loss.backward(). I am running this on 4 a100 GPUs with 32 cores on the cloud with torch.distributed.launch with a batch=8 (assuming this is 8 per GPU).

I forked the repo and you can look at the changes I made to it here. https://github.com/crobbins327/semanticGAN_WSI.git It's mainly just a new dataset for WSIs in dataset.py and some tweaks to train_seg_gan.py to load the dataset. When I disable the path length regularization by setting the g_regularize if statement to False, the code works fine.

Do you have any advice on how to troubleshoot this problem or why the code freezes for the g_regularize step for this dataset at 1024x1024 patch resolution?

snibbor commented 2 years ago

So I read that this might be due to suboptimal operations for A100/CUDA 11 in the double backpropagation for calculating this step. https://github.com/rosinality/stylegan2-pytorch/issues/175 I am going to implement mixed precision training to reduce the GPU memory usage so that I can try training on 4 V100s to see if it fixes the issue.