openai / jukebox

Code for the paper "Jukebox: A Generative Model for Music"
https://openai.com/blog/jukebox/
Other
7.78k stars 1.4k forks source link

Unhandled NCCL error when running on WSL2 with CUDA (tested w/o apex) #187

Open itsmeow opened 3 years ago

itsmeow commented 3 years ago

Setup

After doing all the updates and things to get CUDA on WSL2 (this guide: https://docs.nvidia.com/cuda/wsl-user-guide/index.html), I managed to get the program to run.

Per the guide's instructions, I did the following after upgrading to WSL2 and installing the CUDA driver for Windows:

sudo apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
sudo sh -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list'
sudo apt-get update

I then installed CUDA Toolkit 10.0.0:

sudo apt-get install cuda-toolkit-10-0

I also had to add some symlinks to gcc-7 and g++-7 in order to get apex's NVCC to compile, so those are a thing.

sudo apt install gcc-7 g++-7
sudo ln -s /usr/bin/gcc-7 /usr/local/cuda/bin/gcc 
sudo ln -s /usr/bin/g++-7 /usr/local/cuda/bin/g++

Issue

However, whenever I try sampling anything, the program throws an error. I figured this might because the apex install uses and older version of pytorch so I tried it without apex, but the exact same error happens. Here's the log and a bunch of versions

Sample run + Log (with NCCL_DEBUG=INFO)

(jukebox) itsmeow@WSPC-2:~/OpenAI-Jukebox/jukebox$ python jukebox/sample.py --model=5b_lyrics --name=bad_day_5b_prompted --levels=3 --mode=primed --audio_file=bad_day.wav --prompt_length_in_seconds=17 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=6 --hop_fraction=0.5,0.5,0.125
Using cuda True
{'name': 'bad_day_5b_prompted', 'levels': 3, 'sample_length_in_seconds': 20, 'total_sample_length_in_seconds': 180, 'sr': 44100, 'n_samples': 6, 'hop_fraction': (0.5, 0.5, 0.125)}
Setting sample length to 881920 (i.e. 19.998185941043083 seconds) to be multiple of 128
Downloading from azure
WSPC-2:11126:11126 [0] NCCL INFO Bootstrap : Using [0]eth0:172.31.195.62<0>
WSPC-2:11126:11126 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so).

WSPC-2:11126:11126 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
WSPC-2:11126:11126 [0] NCCL INFO NET/Socket : Using [0]eth0:172.31.195.62<0>
NCCL version 2.4.8+cuda10.0

WSPC-2:11126:11161 [0] misc/topo.cc:22 NCCL WARN Could not find real path of /sys/class/pci_bus/0000:06/../../0000:06:00.0
WSPC-2:11126:11161 [0] NCCL INFO init.cc:876 -> 2
WSPC-2:11126:11161 [0] NCCL INFO init.cc:909 -> 2
WSPC-2:11126:11161 [0] NCCL INFO init.cc:947 -> 2
WSPC-2:11126:11161 [0] NCCL INFO misc/group.cc:69 -> 2 [Async thread]
Traceback (most recent call last):
  File "jukebox/sample.py", line 279, in <module>
    fire.Fire(run)
  File "/home/itsmeow/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/itsmeow/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/itsmeow/miniconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "jukebox/sample.py", line 276, in run
    save_samples(model, device, hps, sample_hps)
  File "jukebox/sample.py", line 181, in save_samples
    vqvae, priors = make_model(model, device, hps)
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/make_models.py", line 191, in make_model
    vqvae = make_vqvae(setup_hparams(vqvae, dict(sample_length=hps.get('sample_length', 0), sample_length_in_seconds=hps.get('sample_length_in_seconds', 0))), device)
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/make_models.py", line 95, in make_vqvae
    restore_model(hps, vqvae, hps.restore_vqvae)
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/make_models.py", line 55, in restore_model
    checkpoint = load_checkpoint(checkpoint_path)
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/make_models.py", line 36, in load_checkpoint
    dist.barrier()
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/utils/dist_adapter.py", line 35, in barrier
    return _barrier()
  File "/home/itsmeow/OpenAI-Jukebox/jukebox/jukebox/utils/dist_adapter.py", line 68, in _barrier
    return dist.barrier()
  File "/home/itsmeow/miniconda3/envs/jukebox/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1489, in barrier
    work = _default_pg.barrier()
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1579040055865/work/torch/lib/c10d/ProcessGroupNCCL.cpp:410, unhandled system error, NCCL version 2.4.8

Versions + Environment info

WSL2 Kernel version

(jukebox) itsmeow@WSPC-2:~/OpenAI-Jukebox/jukebox$ cat /proc/version
Linux version 4.19.128-microsoft-standard (oe-user@oe-host) (gcc version 8.2.0 (GCC)) #1 SMP Tue Jun 23 12:58:10 UTC 2020

NCCL Environment Variables

(jukebox) itsmeow@WSPC-2:~/OpenAI-Jukebox/jukebox$ printenv | grep NCCL
NCCL_DEBUG=INFO

Conda Packages Installed

(jukebox) itsmeow@WSPC-2:~/OpenAI-Jukebox/jukebox$ conda list
# packages in environment at /home/itsmeow/miniconda3/envs/jukebox:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
apex                      0.1                      pypi_0    pypi
audioread                 2.1.9                    pypi_0    pypi
av                        7.0.1            py37h82f89c2_2    conda-forge
bzip2                     1.0.8                h516909a_3    conda-forge
ca-certificates           2020.11.8            ha878542_0    conda-forge
certifi                   2020.11.8        py37h89c1867_0    conda-forge
cffi                      1.14.4                   pypi_0    pypi
cudatoolkit               10.0.130                      0
decorator                 4.4.2                    pypi_0    pypi
ffmpeg                    4.2.3                h167e202_0    conda-forge
fire                      0.1.3                    pypi_0    pypi
freetype                  2.10.4               h7ca028e_0    conda-forge
gmp                       6.2.1                h58526e2_0    conda-forge
gnutls                    3.6.13               h85f3911_1    conda-forge
intel-openmp              2020.2                      254
joblib                    0.17.0                   pypi_0    pypi
jpeg                      9d                   h36c2ea0_0    conda-forge
jukebox                   1.0                       dev_0    <develop>
lame                      3.100             h14c3975_1001    conda-forge
lcms2                     2.11                 hcbb858e_1    conda-forge
libblas                   3.8.0               17_openblas    conda-forge
libcblas                  3.8.0               17_openblas    conda-forge
libedit                   3.1.20191231         h14c3975_1
libffi                    3.2.1             hf484d3e_1007
libgcc-ng                 9.1.0                hdf63c60_0
libgfortran-ng            7.3.0                hdf63c60_0
libiconv                  1.16                 h516909a_0    conda-forge
liblapack                 3.8.0               17_openblas    conda-forge
libopenblas               0.3.10               h5a2b251_0
libpng                    1.6.37               h21135ba_2    conda-forge
librosa                   0.7.2                    pypi_0    pypi
libstdcxx-ng              9.1.0                hdf63c60_0
libtiff                   4.1.0                h4f3a223_6    conda-forge
libwebp-base              1.1.0                h36c2ea0_3    conda-forge
llvmlite                  0.31.0                   pypi_0    pypi
lz4-c                     1.9.2                he1b5a44_3    conda-forge
mkl                       2020.2                      256
mpi                       1.0                       mpich
mpi4py                    3.0.3            py37hf046da1_1
mpich                     3.3.2                hc856adb_0
ncurses                   6.2                  he6710b0_1
nettle                    3.6                  he412f7d_0    conda-forge
ninja                     1.10.2           py37hff7bd54_0
numba                     0.48.0                   pypi_0    pypi
numpy                     1.19.4           py37h7e9df27_1    conda-forge
olefile                   0.46               pyh9f0ad1d_1    conda-forge
openh264                  2.1.1                h8b12597_0    conda-forge
openssl                   1.1.1h               h516909a_0    conda-forge
pillow                    8.0.1            py37h63a5d19_0    conda-forge
pip                       20.3             py37h06a4308_0
protobuf                  3.14.0                   pypi_0    pypi
pycparser                 2.20                       py_2
python                    3.7.5                h0371630_0
python_abi                3.7                     1_cp37m    conda-forge
pytorch                   1.4.0           py3.7_cuda10.0.130_cudnn7.6.3_0    pytorch
readline                  7.0                  h7b6447c_5
resampy                   0.2.2                    pypi_0    pypi
scikit-learn              0.23.2                   pypi_0    pypi
scipy                     1.5.4                    pypi_0    pypi
setuptools                50.3.2           py37h06a4308_2
six                       1.15.0           py37h06a4308_0
soundfile                 0.10.3.post1             pypi_0    pypi
sqlite                    3.33.0               h62c20be_0
tensorboardx              1.8                      pypi_0    pypi
threadpoolctl             2.1.0                    pypi_0    pypi
tk                        8.6.10               hbc83047_0
torchvision               0.5.0                py37_cu100    pytorch
tqdm                      4.45.0                   pypi_0    pypi
unidecode                 1.1.1                    pypi_0    pypi
wheel                     0.36.0             pyhd3eb1b0_0
x264                      1!152.20180806       h14c3975_0    conda-forge
xz                        5.2.5                h7b6447c_0
zlib                      1.2.11               h7b6447c_3
zstd                      1.4.5                h6597ccf_2    conda-forge

CUDA system packages

(jukebox) itsmeow@WSPC-2:~/OpenAI-Jukebox/jukebox$ sudo apt list --installed | grep cuda

cuda-command-line-tools-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-compiler-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cublas-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cublas-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cudart-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cudart-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cufft-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cufft-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cuobjdump-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cupti-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-curand-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-curand-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cusolver-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cusolver-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cusparse-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-cusparse-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-documentation-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-driver-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-gdb-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-gpu-library-advisor-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-libraries-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-license-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-memcheck-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-misc-headers-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-npp-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-npp-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nsight-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nsight-compute-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvcc-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvdisasm-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvgraph-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvgraph-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvjpeg-10-0/unknown,now 10.0.130.1-1 amd64 [installed,automatic]
cuda-nvjpeg-dev-10-0/unknown,now 10.0.130.1-1 amd64 [installed,automatic]
cuda-nvml-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvprof-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvprune-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvrtc-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvrtc-dev-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvtx-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-nvvp-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-samples-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-toolkit-10-0/unknown,now 10.0.130-1 amd64 [installed]
cuda-tools-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]
cuda-visual-tools-10-0/unknown,now 10.0.130-1 amd64 [installed,automatic]

NVIDIA SMI output

(jukebox) itsmeow@WSPC-2:~/OpenAI-Jukebox/jukebox$ nvidia-smi
Fri Dec  4 02:39:57 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 465.12       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:06:00.0  On |                  N/A |
| 30%   33C    P0    N/A /  75W |    877MiB /  4096MiB |    ERR!      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

(The GPU name is truncated, but I have a GTX 1050 Ti, I know, probably won't run the program very quickly (or at all), but I'd like to try)

Other things I've tried

I've tested it with NCCL_IB_DISABLE=1 and NCCL_SOCKET_IFNAME=lo, similar errors occur. I'm not going to put the output of ifconfig, but eth0 and lo are the only existing interfaces.

Conclusion

Now, I certainly tried just about everything I could find within my technical knowledge in order to get this to run before creating this issue, so please, if anyone has any suggestions, do share! Has anyone ever got this to run on WSL2 Ubuntu? I'm sure it's possible, but I must be missing something. I don't know enough about graphics programming and machine learning to investigate myself, unfortunately.

amannm commented 3 years ago

Couldn't find the issue officially documented anywhere, but I think NCCL simply doesn't support WSL right now.

jinzishuai commented 3 years ago

I have exactly the same problem under WSL

kbjiang commented 3 years ago

Same here. And the same code works on my other machine where Ubuntu is the host OS.

wpflueger commented 2 years ago

I too have this same exact issue. I am able to run the nccl-tests and they pass with my RTX 3070