neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.07k stars 148 forks source link

Epoch progress bars not displayed during torch DDP pruning using scripts/pytorch_vision.py #28

Closed bfineran closed 3 years ago

bfineran commented 3 years ago

Describe the bug After training begins using torch.distributed's DistributedDataParallel with scripts/pytorch_vision.py, console output stops while parallel training occurs in background. Tensorboard logging also does not write any updates. This is likely due to the logic for which node logs updates. Training is still running as the gpu memory usage is listed under nvidia-smi

Expected behavior Epoch and modifier progress should be logged to the terminal as well as tensorboard.

Environment Include all relevant environment information:

  1. OS [e.g. Ubuntu 18.04]: ubuntu 16.04
  2. Python version [e.g. 3.7]: 3.6
  3. SparseML version or commit hash [e.g. 0.1.0, f7245c8]: 6f77f0a
  4. ML framework version(s) [e.g. torch 1.7.1]: torch 1.7.1
  5. Other Python package versions [e.g. SparseZoo, DeepSparse, numpy, ONNX]: n/a
  6. Other relevant environment information [e.g. hardware, CUDA version]: 4x NVIDIA GPUs

To Reproduce Exact steps to reproduce the behavior:

python -m torch.distributed.launch \
  --nproc_per_node 4 \
  scripts/pytorch_vision.py train \
    --use-mixed-precision \
    --recipe-path pruning_resnet50_imagenet.yaml \
    --arch-key resnet50 \
   ...
    --save-dir sparisfy_test \
    --logs-dir sparisfy_test \
    --model-tag resnet50-imagenet \
    --pretrained True \
    --save-best-after 34 \

Errors No errors, just no logging.

Additional context

bfineran commented 3 years ago

This bug was caused by the main worker process waiting for other workers to reach a line of code via torch.distributed.barrier that the other processes were not running in the first place. This caused the main worker to lock and not produce any output. #29 provides a fix for this issue.