Open L0SG opened 5 years ago
Oh my God. I have trained on the multi-GPU version for one week with all of my four GPUs. In the params/flowavenet/
dir, only one checkpoint was generated.
Thanks for pointing out this.
Oops, sorry about the delayed issue post in this repo. Filed the report to the PyTorch repo about two weeks ago, so please stick to v0.4.1 until the issue is resolved.
Update: the issue still persists in the latest 1.0.1 release.
Note: DistributedDataParallel
implementation from @1ytic circumvents the multi-GPU issue, so please use train_apex.py
of the master branch until the issue from DataParallel
(from train.py
) is resolved.
Update: the issue was fixed with the 1.2.0 release. We'll keep this issue open for a while for a future reference.
Currently, we cannot run the multi-GPU training on PyTorch v1.0.0 due to a strange null gradient issue.