Open ktagen-sudo opened 3 years ago
Hi, can you try running the pytorch word-level language modeling example https://github.com/pytorch/examples to verify that non-distributed training at least works?
I haven't tried distributed training with these wheels, so I'm sadly unsure of whether they'd work or not :(
Have you found the method to run the torch.distributed on Tesla K40?
First of all, thank you very much for building all those retrocompatible pytorch binaries for the NVIDIA Tesla K40. I am currently working on distributed computing using the NCCL backend (GPUs). The cluster where I run my experiments is equipped with NVIDIA Tesla K40s. I am using your binary version "torch==1.9.0+cu111"
When running an experiment, I run into the following error "enqueue.cc:215 NCCL WARN Cuda failure 'invalid device function'"
which in turns lead to the following fatal error: --> : "RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1631630742027/work/torch/lib/c10d/ProcessGroupNCCL.cpp:38, unhandled cuda error, NCCL version 2.7.8 0: ncclUnhandledCudaError: Call to CUDA function failed."
It occcurs right after I complete the init phase, hence I get the following log "Init COMPLETE" afterwards.
I found the following resources: (1) https://discuss.pytorch.org/t/ddp-with-nccl-fails-in-16-x-a100/127416/3 (2) https://stackoverflow.com/questions/66807131/how-to-solve-the-famous-unhandled-cuda-error-nccl-version-2-7-8-error (3) https://uonfu.com/q/NVIDIA/nccl/444/754106178 (4) https://discuss.pytorch.org/t/nccl-error-in-pytorch-torch-lib-c10d-processgroupnccl-cpp/125423
I also tried your binary version "torch==1.3.1+cu101" and I got into the same problem.
Do you think there is any way I can work around this issue or my GPUs are just simply too old for Pytorch distributed computing using the NCCL backend?
Again thank you a lot for building all those binaries.