Open peastman opened 9 months ago
Some users started seeing similar behavior to this, so I added this workaround to the README:
Some CUDA systems might hang during a multi-GPU parallel training. Try export NCCL_P2P_DISABLE=1, which disables direct peer to peer GPU communication.
Could it be the root of your issue too? I am assuming this i a multigpu training.
I do not remember the error being as consistent as you say (always 50 steps), so it might be unrelated. OTOH the error suggests a relation to pinned memory, which makes me think of this: https://github.com/torchmd/torchmd-net/blob/166b7db8661696f01c4adeb0cb02313c236061a2/torchmdnet/data.py#L132-L139
It would be great if you could try persistent_workers=False
and pin_memory=False
(separately) and report back.
Thanks! Yes, this is with multiple GPUs. I just started a run with persistent_workers=False
. I'll let you know what happens.
Crossing my fingers, but I think persistent_workers=False
fixed it. My latest training run is up to 70 epochs without crashing.
My training runs always crash after exactly 50 epochs. Looking at the log, there are many repetitions of this error:
and then it finally exits with this error:
Any idea what could be causing it?