Training stuck on g_regularize/path length regularization step, works when disabled

Hello,

Thank you for this fantastic work and this repository. I am working on modifying the code to work with 1024x1024 patches from whole slide images to generate patches and corresponding annotation masks. I was successful at implementing a dataloader for this dataset and tweaking the code to work with it so far, but I am trying to troubleshoot why the training gets stuck on the g_regularize step for this dataset. The code is able to calculate the path loss, but it gets stuck trying to calculate the weighted_path_loss.backward(). I am running this on 4 a100 GPUs with 32 cores on the cloud with torch.distributed.launch with a batch=8 (assuming this is 8 per GPU).

I forked the repo and you can look at the changes I made to it here. https://github.com/crobbins327/semanticGAN_WSI.git It's mainly just a new dataset for WSIs in dataset.py and some tweaks to train_seg_gan.py to load the dataset. When I disable the path length regularization by setting the g_regularize if statement to False, the code works fine.

Do you have any advice on how to troubleshoot this problem or why the code freezes for the g_regularize step for this dataset at 1024x1024 patch resolution?

nv-tlabs / semanticGAN_code

Training stuck on g_regularize/path length regularization step, works when disabled #26