jaywalnut310 / glow-tts

A Generative Flow for Text-to-Speech via Monotonic Alignment Search
MIT License
651 stars 150 forks source link

CUDA error. What version of CUDA are you running? #7

Closed echelon closed 4 years ago

echelon commented 4 years ago

I installed all of the requirements and apex at the provided SHA. When I attempt to train, it crashes with the following error:

CUDA runtime error: an illegal instruction was encountered (73) in magmablas_strsm at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/magmablas/strsm.cu:484
CUDA runtime error: an illegal instruction was encountered (73) in magmablas_strsm at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/magmablas/strsm.cu:485
CUDA runtime error: an illegal instruction was encountered (73) in magma_sgetri_gpu at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/src/sgetri_gpu.cpp:164
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:946
CUDA runtime error: an illegal instruction was encountered (73) in magmablas_strsm at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/magmablas/strsm.cu:484
CUDA runtime error: an illegal instruction was encountered (73) in magmablas_strsm at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/magmablas/strsm.cu:485
CUDA runtime error: an illegal instruction was encountered (73) in magma_sgetri_gpu at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/src/sgetri_gpu.cpp:164
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:944
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:945
CUDA runtime error: an illegal instruction was encountered (73) in magma_queue_destroy_internal at /opt/conda/conda-bld/magma-cuda100_1564975479425/work/interface_cuda/interface.cpp:946
THCudaCheck FAIL file=/pytorch/aten/src/THC/THCCachingHostAllocator.cpp line=296 error=73 : an illegal instruction was encountered
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error in: /pytorch/torch/lib/c10d/../c10d/NCCLUtils.hpp:29, unhandled cuda error
Traceback (most recent call last):
  File "train.py", line 191, in <module>
    main()
  File "train.py", line 34, in main
    mp.spawn(train_and_eval, nprocs=n_gpus, args=(n_gpus, hps,))
  File "/home/bt/dev/2nd/glow-tts/python/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/bt/dev/2nd/glow-tts/python/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/bt/dev/2nd/glow-tts/python/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/bt/dev/2nd/glow-tts/train.py", line 93, in train_and_eval
    train(rank, epoch, hps, generator, optimizer_g, train_loader, None, None)
  File "/home/bt/dev/2nd/glow-tts/train.py", line 117, in train
    scaled_loss.backward()
  File "/home/bt/dev/2nd/glow-tts/python/lib/python3.7/site-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/bt/dev/2nd/glow-tts/python/lib/python3.7/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: an illegal instruction was encountered

I'm thinking that I might be running an out of date CUDA, but I wanted to confirm this before upgrading.

What version are you all using?

Thanks!

jaywalnut310 commented 4 years ago

My CUDA version is V10.0.130. I hope it would be helpful.

nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130

echelon commented 4 years ago

Thanks! As it turns out, I'm running the same version of CUDA.

I'm not sure what the problem was, but the backtrace pointed to torch. I was on 1.2.0 as advised, but I bumped it to 1.3.0 and was able to get rid of the errors.

This model is awesome. Thanks so much for all your hard work!