RuntimeError: CUDA error: device-side assert triggered

berisfu commented 5 years ago

I have crop the data from 110 to 96, the error message disappear. but new error occur: /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T , T , T , long , T , int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [65,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T , T , T , long , T , int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [66,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T , T , T , long , T , int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [80,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T , T , T , long , T , int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [81,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T , T , T , long , T , int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [83,0,0] Assertion t >= 0 && t < n_classes failed. /pytorch/aten/src/THCUNN/SpatialClassNLLCriterion.cu:99: void cunn_SpatialClassNLLCriterion_updateOutput_kernel(T , T , T , long , T , int, int, int, int, int, long) [with T = float, AccumT = float]: block: [0,0,0], thread: [84,0,0] Assertion t >= 0 && t < n_classes failed. Traceback (most recent call last): File "search_cell_sim.py", line 312, in search_network.run() File "search_cell_sim.py", line 212, in run self.train() File "search_cell_sim.py", line 266, in train self.train_loss_meter.update(train_loss.item())

berisfu commented 5 years ago

when I normalized the target tensor and change the tensor from double to Long, it can run without error. But if I change the target(label) tensor to Long without normalization,it will trigger the error above.

berisfu commented 5 years ago

when I normalized the target tensor and change the tensor from double to Long, it can run without error. But if I change the target(label) tensor to Long without normalization,it will trigger the error above.

I think after normalized and then changed to Long, the class of number is just only 2. If change it to Long without normalized,the class of number will be much bigger than 2.

berisfu commented 5 years ago

I have fixed it.

ghost commented 5 years ago

@berisfu how did you fix it?

tianbaochou / NasUnet

RuntimeError: CUDA error: device-side assert triggered #7