Closed moonfoam closed 4 years ago
It looks good to me. I didn't meet such problems. The only different setting is I use 8 V100 with 1 image per gpu
@lxtGH Thanks for your reply, I think it's not much difference with syncBN when the batch size keep same. Here is my log file, I just reduce the learning rate from 0.005 to 0.002, it seems running well. However it suddenly cracked and i don't know why. Do you know what happened? log_20201002_142033.txt
It looks very strange. Iooks like the body loss become NAN. Maybe you try lower down the weights of body loss?
@lxtGH Ok, I will check the output and adjust the parameters I used.
After setting the LR to 0.001, it run 20 epoch stably, and the problem seems solved. I guess it may owing to the DDP training.
I meet the same error, then i find the solution in https://github.com/NVIDIA/semantic-segmentation/issues/29#issuecomment-560472406 works for me. The NaN problem may be caused by the SGD.
I have trained the coarse network following the script in the './scripts/train/train_cityscapes_ResNet50_deeplab.sh' for 97 epochs, and got a nice base_model.
However, when I try to refine it with the script of './scripts/train/train_ciytscapes_ResNet50_deeplab_decouple.sh', it crashed with non, because of no valid label.
I use 2xRTX3090 GPU with cuda11.1 and pytorch 1.17, maybe it caused by the different GPUs, so firstly I want to check out whether the parameters in this script is what you used (for 180 epochs totally)?
the script I used are as following:
train_cityscapes_ResNet50_deeplab.sh
train_ciytscapes_ResNet50_deeplab_decouple.sh