PyTorch v1.0.0 multi-GPU compatibility issue

ksw0306 / FloWaveNet

A Pytorch implementation of "FloWaveNet: A Generative Flow for Raw Audio"

MIT License

490 stars 109 forks source link

PyTorch v1.0.0 multi-GPU compatibility issue #13

Open L0SG opened 5 years ago

L0SG commented 5 years ago

Currently, we cannot run the multi-GPU training on PyTorch v1.0.0 due to a strange null gradient issue.

candlewill commented 5 years ago

Oh my God. I have trained on the multi-GPU version for one week with all of my four GPUs. In the params/flowavenet/ dir, only one checkpoint was generated.

Thanks for pointing out this.

L0SG commented 5 years ago

Oops, sorry about the delayed issue post in this repo. Filed the report to the PyTorch repo about two weeks ago, so please stick to v0.4.1 until the issue is resolved.

L0SG commented 5 years ago

Update: the issue still persists in the latest 1.0.1 release.

L0SG commented 5 years ago

Note: DistributedDataParallel implementation from @1ytic circumvents the multi-GPU issue, so please use train_apex.py of the master branch until the issue from DataParallel (from train.py) is resolved.

L0SG commented 5 years ago

Update: the issue was fixed with the 1.2.0 release. We'll keep this issue open for a while for a future reference.