knazeri / edge-connect

EdgeConnect: Structure Guided Image Inpainting using Edge Prediction, ICCV 2019 https://arxiv.org/abs/1901.00212
http://openaccess.thecvf.com/content_ICCVW_2019/html/AIM/Nazeri_EdgeConnect_Structure_Guided_Image_Inpainting_using_Edge_Prediction_ICCVW_2019_paper.html
Other
2.5k stars 528 forks source link

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. #191

Open kaelyavel opened 1 year ago

kaelyavel commented 1 year ago

Hello,

I ran into this issue today while trying to train the Inpainting model and the Joint Model on Google Colab with a GPU.

I was able to train the Edge model "successfully" (Because I can't check yet if the training gives correct values) thanks to this #188 that included a working fix to the issue I encountered. But the fix produces the following error while training the Inpainting (Model=2), Joint (Model=4) and Inpainting-Joint (Model=3) models. I tried without the fix (with the vanilla models.py) but it gives back the issue #188 .

Training epoch: 1
144/168 [================>...] - ETA: 4s - epoch: 1 - iter: 18 - psnr: 31.0681 - mae: 0.0096/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [15,0,0], thread: [0,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [15,0,0], thread: [1,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [15,0,0], thread: [2,0,0] Assertion `input_val >= zero && input_val <= one` failed.
.... [SAME LINE] ....

/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [28,0,0], thread: [30,0,0] Assertion `input_val >= zero && input_val <= one` failed.
/pytorch/aten/src/ATen/native/cuda/Loss.cu:115: operator(): block: [28,0,0], thread: [31,0,0] Assertion `input_val >= zero && input_val <= one` failed.
Traceback (most recent call last):
  File "train.py", line 2, in <module>
    main(mode=1)
  File "/content/edge-connect/main.py", line 56, in main
    model.train()
  File "/content/edge-connect/src/edge_connect.py", line 145, in train
    outputs, gen_loss, dis_loss, logs = self.inpaint_model.process(images, outputs.detach(), masks)
  File "/content/edge-connect/src/models.py", line 239, in process
    ("l_d2", dis_loss.item()),
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.