RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. #191
I ran into this issue today while trying to train the Inpainting model and the Joint Model on Google Colab with a GPU.
I was able to train the Edge model "successfully" (Because I can't check yet if the training gives correct values) thanks to this #188 that included a working fix to the issue I encountered. But the fix produces the following error while training the Inpainting (Model=2), Joint (Model=4) and Inpainting-Joint (Model=3) models. I tried without the fix (with the vanilla models.py) but it gives back the issue #188 .
Training epoch: 1
144/168 [================>...] - ETA: 4s - epoch: 1 - iter: 18 - psnr: 31.0681 - mae: 0.0096/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [15,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [15,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [15,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
.... [SAME LINE] ....
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [28,0,0], thread: [30,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [28,0,0], thread: [31,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
File "train.py", line 2, in <module>
main(mode=1)
File "/content/edge-connect/main.py", line 56, in main
model.train()
File "/content/edge-connect/src/edge_connect.py", line 145, in train
outputs, gen_loss, dis_loss, logs = self.inpaint_model.process(images, outputs.detach(), masks)
File "/content/edge-connect/src/models.py", line 239, in process
("l_d2", dis_loss.item()),
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Hello,
I ran into this issue today while trying to train the Inpainting model and the Joint Model on Google Colab with a GPU.
I was able to train the Edge model "successfully" (Because I can't check yet if the training gives correct values) thanks to this #188 that included a working fix to the issue I encountered. But the fix produces the following error while training the Inpainting (Model=2), Joint (Model=4) and Inpainting-Joint (Model=3) models. I tried without the fix (with the vanilla models.py) but it gives back the issue #188 .