train.py halts at random

I am using the training of this repo in a straightforward way so far, no modifications to the model or changes to the training steps. Calling training via commandline just as described in the Readme.

Sometimes, however, the loop just halts indefinitely! The only thing that will make it resume is inputting a key to the commandline, like pressing 'enter.' CPU usage drops to near-zero for the process. This has been tested by leaving it frozen for over an hour, and it did not unfreeze itself.

Through simple debug print statements I found the location that it halts at is train.py line 67: outputs = model(inputs)['out'] Seeing that's the big action of training, I'm not surprised that's the line where it freezes. Does anyone know why this could be happening? There doesn't seem to be a pattern of when it freezes; usually it freezes 0 times per epoch, sometimes as many as 4 times per epoch.

Any help is appreciated! This problem stops me from pressing 'Go' and letting the model train; it forces me to check back in on it.

jnkl314 / DeepLabV3FineTuning

train.py halts at random #2