p-lambda / jukemir

Perform transfer learning for MIR using Jukebox!
MIT License
174 stars 23 forks source link

RuntimeError: CuDNN Error: CUDNN_STATUS_MAPPING_ERROR #12

Closed rbyhrb closed 1 year ago

rbyhrb commented 1 year ago

When I ran the docker, I first got the found no nvidia driver error as issue. After installing nvidia-container, the problem seemed solved.

Then I tried again the following command. Since I have 2 cards on the machine, only card 0 is assigned. sudo docker run -it --rm --gpus='"device=0"' -v xxx:/input -v xxx:/output --entrypoint bash jukemir/representations_jukebox And then, python main.py --batch_size 8

After a few minutes (of initializing I guess), I got the following error: Traceback (most recent call last): File "main.py", line 177, in representation = get_acts_from_file(input_path, hps, vqvae, top_prior, meanpool=True) File "main.py", line 86, in get_acts_from_file z = get_z(audio, vqvae) File "main.py", line 27, in get_z zs = vqvae.encode(torch.cuda.FloatTensor(audio[np.newaxis, :, np.newaxis])) File "/code/jukebox/jukebox/vqvae/vqvae.py", line 141, in encode zs_i = self._encode(x_i, start_level=start_level, end_level=end_level) File "/code/jukebox/jukebox/vqvae/vqvae.py", line 132, in _encode x_out = encoder(x_in) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/code/jukebox/jukebox/vqvae/encdec.py", line 80, in forward x = level_block(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/code/jukebox/jukebox/vqvae/encdec.py", line 26, in forward return self.model(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 202, in forward self.padding, self.dilation, self.groups) RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

I googled it and added torch.backends.cudnn.enabled = False to main.py but a new problem occurred: Traceback (most recent call last): File "main.py", line 179, in representation = get_acts_from_file(input_path, hps, vqvae, top_prior, meanpool=True) File "main.py", line 88, in get_acts_from_file z = get_z(audio, vqvae) File "main.py", line 29, in get_z zs = vqvae.encode(torch.cuda.FloatTensor(audio[np.newaxis, :, np.newaxis])) File "/code/jukebox/jukebox/vqvae/vqvae.py", line 141, in encode zs_i = self._encode(x_i, start_level=start_level, end_level=end_level) File "/code/jukebox/jukebox/vqvae/vqvae.py", line 132, in _encode x_out = encoder(x_in) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, kwargs) File "/code/jukebox/jukebox/vqvae/encdec.py", line 80, in forward x = level_block(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/code/jukebox/jukebox/vqvae/encdec.py", line 26, in forward return self.model(x) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, *kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(input, **kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 202, in forward self.padding, self.dilation, self.groups) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)

Did I miss anything?

rbyhrb commented 1 year ago

I solved the problem by copying the souce codes from docker to my local dir and run in my own conda env. Seems the preblem originated from environment issues.