Open snibbor opened 2 years ago
So I read that this might be due to suboptimal operations for A100/CUDA 11 in the double backpropagation for calculating this step. https://github.com/rosinality/stylegan2-pytorch/issues/175 I am going to implement mixed precision training to reduce the GPU memory usage so that I can try training on 4 V100s to see if it fixes the issue.
Hello,
Thank you for this fantastic work and this repository. I am working on modifying the code to work with 1024x1024 patches from whole slide images to generate patches and corresponding annotation masks. I was successful at implementing a dataloader for this dataset and tweaking the code to work with it so far, but I am trying to troubleshoot why the training gets stuck on the g_regularize step for this dataset. The code is able to calculate the path loss, but it gets stuck trying to calculate the
weighted_path_loss.backward()
. I am running this on 4 a100 GPUs with 32 cores on the cloud with torch.distributed.launch with a batch=8 (assuming this is 8 per GPU).I forked the repo and you can look at the changes I made to it here. https://github.com/crobbins327/semanticGAN_WSI.git It's mainly just a new dataset for WSIs in
dataset.py
and some tweaks totrain_seg_gan.py
to load the dataset. When I disable the path length regularization by setting the g_regularize if statement to False, the code works fine.Do you have any advice on how to troubleshoot this problem or why the code freezes for the g_regularize step for this dataset at 1024x1024 patch resolution?