openai / jukebox

Code for the paper "Jukebox: A Generative Model for Music"
Unhandled NCCL error when running on WSL2 with CUDA (tested w/o apex) #187

Open itsmeow opened 3 years ago

itsmeow commented 3 years ago


After doing all the updates and things to get CUDA on WSL2 (this guide:, I managed to get the program to run.

Per the guide's instructions, I did the following after upgrading to WSL2 and installing the CUDA driver for Windows:

sudo apt-key adv --fetch-keys
sudo sh -c 'echo "deb /" > /etc/apt/sources.list.d/cuda.list'
sudo apt-get update

I then installed CUDA Toolkit 10.0.0:

sudo apt-get install cuda-toolkit-10-0

I also had to add some symlinks to gcc-7 and g++-7 in order to get apex's NVCC to compile, so those are a thing.

sudo apt install gcc-7 g++-7
sudo ln -s /usr/bin/gcc-7 /usr/local/cuda/bin/gcc 
sudo ln -s /usr/bin/g++-7 /usr/local/cuda/bin/g++


However, whenever I try sampling anything, the program throws an error. I figured this might because the apex install uses and older version of pytorch so I tried it without apex, but the exact same error happens. Here's the log and a bunch of versions

Sample run + Log (with NCCL_DEBUG=INFO)

(jukebox) itsmeow@WSPC-2:~/OpenAI-Jukebox/jukebox$ python jukebox/ --model=5b_lyrics --name=bad_day_5b_prompted --levels=3 --mode=primed --audio_file=bad_day.wav --prompt_length_in_seconds=17 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125
Using cuda True
{'name': 'bad_day_5b_prompted', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 6, 'hop_fraction': (0.5, 0.5, 0.125)}
Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128
Downloading from azure
WSPC-2:11126:11126 [0] NCCL INFO Bootstrap : Using [0]eth0:<0>
WSPC-2:11126:11126 [0] NCCL INFO NET/Plugin : No plugin found (

WSPC-2:11126:11126 [0] misc/ NCCL WARN Failed to open[.1]
WSPC-2:11126:11126 [0] NCCL INFO NET/Socket : Using [0]eth0:<0>
NCCL version 2.4.8+cuda10.0

WSPC-2:11126:11161 [0] misc/ NCCL WARN Could not find real path of /sys/class/pci_bus/0000:06/../../0000:06:00.0
WSPC-2:11126:11161 [0] NCCL INFO -> 2
WSPC-2:11126:11161 [0] NCCL INFO -> 2
WSPC-2:11126:11161 [0] NCCL INFO -> 2
WSPC-2:11126:11161 [0] NCCL INFO misc/ -> 2 [Async thread]
Traceback (most recent call last):
  File "jukebox/", line 279, in <module>
  File "/home/itsmeow/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/itsmeow/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/", line 366, in _Fire
    component, remaining_args)
  File "/home/itsmeow/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "jukebox/", line 276, in run
    save_samples(model, device, hps, sample_hps)
  File "jukebox/", line 181, in save_samples
    vqvae, priors = make_model(model, device, hps)
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/", line 191, in make_model
    vqvae = make_vqvae(setup_hparams(vqvae, dict(sample_length=hps.get('sample_length', 0), sample_length_in_seconds=hps.get('sample_length_in_seconds', 0))), device)
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/", line 95, in make_vqvae
    restore_model(hps, vqvae, hps.restore_vqvae)
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/", line 55, in restore_model
    checkpoint = load_checkpoint(checkpoint_path)
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/", line 36, in load_checkpoint
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/utils/", line 35, in barrier
    return _barrier()
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/utils/", line 68, in _barrier
    return dist.barrier()
  File "/home/itsmeow/miniconda3/envs/jukebox/lib/python3.7/site-packages/torch/distributed/", line 1489, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1579040055865/work/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.4.8

Versions + Environment info

WSL2 Kernel version

(jukebox) itsmeow@WSPC-2:~/OpenAI-Jukebox/jukebox$ cat /proc/version
Linux version 4.19.128-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Tue Jun 23 12:58:10 UTC 2020

NCCL Environment Variables

(jukebox) itsmeow@WSPC-2:~/OpenAI-Jukebox/jukebox$ printenv | grep NCCL

Conda Packages Installed

CUDA system packages

(jukebox) itsmeow@WSPC-2:~/OpenAI-Jukebox/jukebox$ nvidia-smi
Fri Dec  4 02:39:57 2020
Fri Dec  4 02:39:57 2020
| NVIDIA-SMI 455.45.01    Driver Version: 465.12       CUDA Version: 11.2     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  GeForce GTX 105...  Off  | 00000000:06:00.0  On |                  N/A |
| 30%   33C    P0    N/A /  75W |    877MiB /  4096MiB |    ERR!      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

(The GPU name is truncated, but I have a GTX 1050 Ti, I know, probably won't run the program very quickly (or at all), but I'd like to try)

Other things I've tried

I've tested it with NCCL_IB_DISABLE=1 and NCCL_SOCKET_IFNAME=lo, similar errors occur. I'm not going to put the output of ifconfig, but eth0 and lo are the only existing interfaces.


Now, I certainly tried just about everything I could find within my technical knowledge in order to get this to run before creating this issue, so please, if anyone has any suggestions, do share! Has anyone ever got this to run on WSL2 Ubuntu? I'm sure it's possible, but I must be missing something. I don't know enough about graphics programming and machine learning to investigate myself, unfortunately.

amannm commented 3 years ago

Couldn't find the issue officially documented anywhere, but I think NCCL simply doesn't support WSL right now.

jinzishuai commented 3 years ago

I have exactly the same problem under WSL

kbjiang commented 3 years ago

Same here. And the same code works on my other machine where Ubuntu is the host OS.

wpflueger commented 2 years ago

I too have this same exact issue. I am able to run the nccl-tests and they pass with my RTX 3070