Closed 1ytic closed 5 years ago
Huge thanks for apex & DistributedDataParallel
integration (plus the nicer tqdm bar)! We also verified that the log-determinant changes throughout iterations using this (rather than DataParallel), so the current incompatibility issue seems specific to a reference count issue of DataParallel.
Apex utilities https://github.com/NVIDIA/apex handle some issues with specific nodes in the FloWaveNet architecture.
List of changes made in train.py:
For example, to run on 4 GPUs, use the following command: python -m torch.distributed.launch --nproc_per_node=4 train_apex.py --num_workers 2 --epochs 1000
Resolves: #13 See also: #16