jnkl314 / DeepLabV3FineTuning

Semantic Segmentation : Multiclass fine tuning of DeepLabV3 with PyTorch
MIT License
34 stars 10 forks source link

train.py halts at random #2

Closed aedirn closed 3 years ago

aedirn commented 4 years ago

I am using the training of this repo in a straightforward way so far, no modifications to the model or changes to the training steps. Calling training via commandline just as described in the Readme.

Sometimes, however, the loop just halts indefinitely! The only thing that will make it resume is inputting a key to the commandline, like pressing 'enter.' CPU usage drops to near-zero for the process. This has been tested by leaving it frozen for over an hour, and it did not unfreeze itself.

Through simple debug print statements I found the location that it halts at is train.py line 67: outputs = model(inputs)['out'] Seeing that's the big action of training, I'm not surprised that's the line where it freezes. Does anyone know why this could be happening? There doesn't seem to be a pattern of when it freezes; usually it freezes 0 times per epoch, sometimes as many as 4 times per epoch.

Any help is appreciated! This problem stops me from pressing 'Go' and letting the model train; it forces me to check back in on it.

aedirn commented 3 years ago

Since then, I've found that it's a Pytorch issue training on CPU, not really an issue stemming from DeepLabV3FineTuning.