junyanz / pytorch-CycleGAN-and-pix2pix

Image-to-Image Translation in PyTorch
Other
22.68k stars 6.27k forks source link

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemv(handle, op, m, n, &alpha, a, lda, x, incx, &beta, y, incy) #1008

Open yeonjaej opened 4 years ago

yeonjaej commented 4 years ago

Hi CycleGAN developers,

I'm encountering the error below while training example. The training goes for a bit, and it crashes. Any suggestion I can try? I appreciate your help in advance!

python train.py --dataroot ./datasets/maps --name maps_cyclegan --model cycle_gan

(epoch: 1, iters: 400, time: 0.772, data: 0.003) D_A: 0.268 G_A: 0.446 cycle_A: 1.283 idt_A: 0.335 D_B: 0.224 G_B: 0.270 cycle_B: 0.656 idt_B: 0.577 (epoch: 1, iters: 500, time: 0.303, data: 0.002) D_A: 0.233 G_A: 0.391 cycle_A: 2.048 idt_A: 0.318 D_B: 0.238 G_B: 0.284 cycle_B: 0.659 idt_B: 1.037 Traceback (most recent call last): File "train.py", line 52, in model.optimize_parameters() # calculate loss functions, get gradients, update network weights File "/data/yjwa/torchwork/pytorch-CycleGAN-and-pix2pix/models/cycle_gan_model.py", line 187, in optimize_parameters self.backward_G() # calculate gradients for G_A and G_B File "/data/yjwa/torchwork/pytorch-CycleGAN-and-pix2pix/models/cycle_gan_model.py", line 178, in backward_G self.loss_G.backward() File "/data/yjwa/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/data/yjwa/anaconda3/lib/python3.7/site-packages/torch/autograd/init.py", line 100, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemv(handle, op, m, n, &alpha, a, lda, x, incx, &beta, y, incy) (gemv at /opt/conda/conda-bld/pytorch_1587428398394/work/aten/src/ATen/cuda/CUDABlas.cpp:318) frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x4e (0x7fad9ef93b5e in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0xdb9fa7 (0x7fad9ff79fa7 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #2: at::native::(anonymous namespace)::slow_conv_transpose2d_acc_grad_parameters_cuda_template(at::Tensor const&, at::Tensor const&, at::Tensor&, at::Tensor&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, int) + 0xea6 (0x7fada186ce86 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #3: at::native::slow_conv_transpose2d_backward_cuda(at::Tensor const&, at::Tensor const&, at::Tensor const&, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, c10::ArrayRef, at::Tensor const&, at::Tensor const&, std::array<bool, 3ul>) + 0x323 (0x7fada1871c93 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xe1f64d (0x7fad9ffdf64d in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #5: + 0xe28007 (0x7fad9ffe8007 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #6: + 0x29e286e (0x7fadc8b5786e in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #7: + 0xe23c87 (0x7fadc6f98c87 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #8: torch::autograd::generated::SlowConvTranspose2DBackward::apply(std::vector<at::Tensor, std::allocator >&&) + 0x516 (0x7fadc87a0c46 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #9: + 0x2ae8215 (0x7fadc8c5d215 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #10: torch::autograd::Engine::evaluate_function(std::shared_ptr&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7fadc8c5a513 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #11: torch::autograd::Engine::thread_main(std::shared_ptr const&, bool) + 0x3d2 (0x7fadc8c5b2f2 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #12: torch::autograd::Engine::thread_init(int) + 0x39 (0x7fadc8c53969 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so) frame #13: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7fadcbf9a558 in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #14: + 0xc819d (0x7fadcea1119d in /data/yjwa/anaconda3/lib/python3.7/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #15: + 0x7e65 (0x7faded2f3e65 in /lib64/libpthread.so.0) frame #16: clone + 0x6d (0x7faded01c88d in /lib64/libc.so.6)`

junyanz commented 4 years ago

Not sure. It might be related to mixed-precision training. But we don't do it by default. This post might be related.

yeonjaej commented 4 years ago

Thank you. I wonder how mixed-precision calculation might happen, since I did not change anything from the build. Can cublas change precision internally? This https://github.com/pytorch/pytorch/issues/37157#issue-605720486 seems related too.

mcarilli commented 4 years ago

Your stack trace is exactly the same as I'm investigating in https://github.com/pytorch/pytorch/issues/37157. However, my minimal repro only fails if the slow_conv_transpose2d_backward_cuda runs in FP16. If you have a repro that fails with FP32 inputs, that would be interesting to know.

yeonjaej commented 4 years ago

Hi. I checked the input is float32. Anyways, the issue went away when I assigned gpu_ids 1 option (default was 0).

Just adding the print out from nvidia-smi here.

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A | | 0% 28C P0 58W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... Off | 00000000:03:00.0 Off | N/A | | 0% 34C P0 59W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... Off | 00000000:83:00.0 Off | N/A | | 0% 30C P0 54W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... Off | 00000000:84:00.0 Off | N/A | | 0% 27C P0 52W / 250W | 0MiB / 11178MiB | 2% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

junyanz commented 4 years ago

Interesting. I am not sure why it happened. --gpu_ids is 0-indexed. If you set --gpu_ids -1, it will use CPU.