Open itsmeow opened 3 years ago
Couldn't find the issue officially documented anywhere, but I think NCCL simply doesn't support WSL right now.
I have exactly the same problem under WSL
Same here. And the same code works on my other machine where Ubuntu is the host OS.
I too have this same exact issue. I am able to run the nccl-tests and they pass with my RTX 3070
Setup
After doing all the updates and things to get CUDA on WSL2 (this guide: https://docs.nvidia.com/cuda/wsl-user-guide/index.html), I managed to get the program to run.
Per the guide's instructions, I did the following after upgrading to WSL2 and installing the CUDA driver for Windows:
I then installed CUDA Toolkit 10.0.0:
I also had to add some symlinks to gcc-7 and g++-7 in order to get apex's NVCC to compile, so those are a thing.
Issue
However, whenever I try sampling anything, the program throws an error. I figured this might because the apex install uses and older version of pytorch so I tried it without apex, but the exact same error happens. Here's the log and a bunch of versions
Sample run + Log (with
NCCL_DEBUG=INFO
)Versions + Environment info
WSL2 Kernel version
NCCL Environment Variables
Conda Packages Installed
CUDA system packages
NVIDIA SMI output
(The GPU name is truncated, but I have a GTX 1050 Ti, I know, probably won't run the program very quickly (or at all), but I'd like to try)
Other things I've tried
I've tested it with
NCCL_IB_DISABLE=1
andNCCL_SOCKET_IFNAME=lo
, similar errors occur. I'm not going to put the output ofifconfig
, buteth0
andlo
are the only existing interfaces.Conclusion
Now, I certainly tried just about everything I could find within my technical knowledge in order to get this to run before creating this issue, so please, if anyone has any suggestions, do share! Has anyone ever got this to run on WSL2 Ubuntu? I'm sure it's possible, but I must be missing something. I don't know enough about graphics programming and machine learning to investigate myself, unfortunately.