isekai-portal / Link-Context-Learning

MIT License
80 stars 7 forks source link

torch.distributed.DistBackendError: NCCL error #8

Closed osttkm closed 7 months ago

osttkm commented 7 months ago

Hello,

I've encountered an NCCL error during training, which seems to stem from an incompatibility between the torch version and the NVIDIA NCCL version, relative to the required CUDA toolkit version.

Installing torch directly from PyTorch's official website prompts a compilation of nvidia-nccl-cu, making it challenging to resolve the compatibility issue.

Could you please specify the versions of torch and nvidia-nccl-cu used in this repository? Additionally, if there were any installations performed beyond the command recommended by PyTorch (pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118), could you share those details as well?

I'm looking for guidance on how to align the versions properly or any alternative installation steps you might recommend to avoid this issue. Your assistance in this matter would be greatly appreciated.

Thank you!

osttkm commented 7 months ago

Thank you for your support. I've managed to resolve the issue myself. I confirmed that this problem occurs in my environment with torch versions higher than 2.2.0. I solved it by downgrading the torch version.