Closed bfineran closed 3 years ago
This bug was caused by the main worker process waiting for other workers to reach a line of code via torch.distributed.barrier
that the other processes were not running in the first place. This caused the main worker to lock and not produce any output. #29 provides a fix for this issue.
Describe the bug After training begins using torch.distributed's DistributedDataParallel with
scripts/pytorch_vision.py
, console output stops while parallel training occurs in background. Tensorboard logging also does not write any updates. This is likely due to the logic for which node logs updates. Training is still running as the gpu memory usage is listed undernvidia-smi
Expected behavior Epoch and modifier progress should be logged to the terminal as well as tensorboard.
Environment Include all relevant environment information:
f7245c8
]: 6f77f0aTo Reproduce Exact steps to reproduce the behavior:
Errors No errors, just no logging.
Additional context