Retrocompatible binaries cause an issue when working with pytorch distributed using NCCL backend

ktagen-sudo commented 3 years ago

First of all, thank you very much for building all those retrocompatible pytorch binaries for the NVIDIA Tesla K40. I am currently working on distributed computing using the NCCL backend (GPUs). The cluster where I run my experiments is equipped with NVIDIA Tesla K40s. I am using your binary version "torch==1.9.0+cu111"

When running an experiment, I run into the following error "enqueue.cc:215 NCCL WARN Cuda failure 'invalid device function'"

which in turns lead to the following fatal error: --> : "RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1631630742027/work/torch/lib/c10d/ProcessGroupNCCL.cpp:38, unhandled cuda error, NCCL version 2.7.8 0: ncclUnhandledCudaError: Call to CUDA function failed."

It occcurs right after I complete the init phase, hence I get the following log "Init COMPLETE" afterwards.

I found the following resources: (1) https://discuss.pytorch.org/t/ddp-with-nccl-fails-in-16-x-a100/127416/3 (2) https://stackoverflow.com/questions/66807131/how-to-solve-the-famous-unhandled-cuda-error-nccl-version-2-7-8-error (3) https://uonfu.com/q/NVIDIA/nccl/444/754106178 (4) https://discuss.pytorch.org/t/nccl-error-in-pytorch-torch-lib-c10d-processgroupnccl-cpp/125423

I also tried your binary version "torch==1.3.1+cu101" and I got into the same problem.

Do you think there is any way I can work around this issue or my GPUs are just simply too old for Pytorch distributed computing using the NCCL backend?

Again thank you a lot for building all those binaries.

nelson-liu commented 3 years ago

Hi, can you try running the pytorch word-level language modeling example https://github.com/pytorch/examples to verify that non-distributed training at least works?

I haven't tried distributed training with these wheels, so I'm sadly unsure of whether they'd work or not :(

Dominic789654 commented 2 years ago

Have you found the method to run the torch.distributed on Tesla K40?

nelson-liu / pytorch-manylinux-binaries

Retrocompatible binaries cause an issue when working with pytorch distributed using NCCL backend #4