failure to initialize NCCL

metaphorz commented 3 years ago

I think that NCCL is part of PyTorch? I am running Python 3.9 so I had to install torch using

-c=conda-forge

as specified in the instructions for installing torch. It seemed to install correctly.

........

jukebox paul$ python3 jukebox/sample.py --model=5b_lyrics --name=sample_5b --levels=3 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125 Caught error during NCCL init (attempt 0 of 5): Distributed package doesn't have NCCL built in Caught error during NCCL init (attempt 1 of 5): Distributed package doesn't have NCCL built in Caught error during NCCL init (attempt 2 of 5): Distributed package doesn't have NCCL built in Caught error during NCCL init (attempt 3 of 5): Distributed package doesn't have NCCL built in Caught error during NCCL init (attempt 4 of 5): Distributed package doesn't have NCCL built in Traceback (most recent call last): File "jukebox/sample.py", line 279, in fire.Fire(run)

...bunch more stuff....

raise RuntimeError("Failed to initialize NCCL")

RuntimeError: Failed to initialize NCCL (jukebox) host:jukebox paul$

nbinobied commented 3 years ago

I was able to pass this issue by installing miniconda Python 3.7, then changing the line 43 in jukebox/utils/dist_utils.py from backend = "nccl" to backend="gloo"

redsphinx commented 1 year ago

Ran into this problem on my ubuntu 22.04 machine with a 3090 running cuda 11.7. I solved it by doing the following installation. Changed the cudatoolkit version that matches my system and explicitly installed the oldest possible pytorch + gpu capabilities that matches my cuda version as close as possible. I also changed backend to gloo like @nbinobied suggested. Only changing backed didn't work for me.

conda create --name jukebox python=3.7.5
conda activate jukebox
conda install mpi4py=3.0.3 # if this fails, try: pip install mpi4py==3.0.3
conda install pytorch=1.4 torchvision=0.5 cudatoolkit=11.0.221 -c pytorch 
git clone https://github.com/openai/jukebox.git
cd jukebox
pip install -r requirements.txt
pip install -e .
pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116

itsnotlupus commented 1 year ago

I had this issue on a WSL 2 setup where I followed the README instructions religiously. Using @redsphinx' last line made it work (ie grabbed a version of NCCL that agreed with WSL and didn't clash with other dependencies), with the backend left unchanged:

pip install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu116

My 3090 ti has now generated 6 wonderfully weird music samples. All is well.

openai / jukebox

failure to initialize NCCL #216