CUDA error while calculating loss

anouskashrestha commented 2 years ago

I was trying to run train_RAPFT_step1 in cityscapes dataset. While trying to run the code, I am getting error saying CUDA error: device-side assert triggered. I tried using CUDA_LAUNCH_BLOCKING=1 and it is showing the following error.

prachigarg23 commented 2 years ago

Hi @anouskashrestha , This error means there is a discrepancy between the dimensionality of the outputs and targets tensors. Can you print them and try to debug? Also please share the training command and entire error log.

anouskashrestha commented 2 years ago

I printed the shape of the outputs and target tensors and got output tensors as (6,20,512,1024) and target tensors as (6,512,1024). I cloned this repository and only changed path of dataroot and ran the code according to the trainer_OURS.sh script. My training command was: CUDA_LAUNCH_BLOCKING=1 python3 train_RAPFT_step1.py --savedir Adaptations/RAP_FT_CS1 --num-epochs 150 --batch-size 6 --state "/local/scratch/a/shrest13/MDIL-SS-main/trained_models/erfnet_encoder_pretrained.pth.tar" --num-classes 20 --current_task=0 --dataset='cityscapes' This is the training code (train_RAPFT_step1.py) This is the error message displayed:

prachigarg23 commented 1 year ago

Hi @anouskashrestha , I'm sorry for not being able to get back on this. The criterion is defined inside the class CrossEntropyLoss2d, its a torch.nn.NLLLoss2d loss. The outputs should be of dimensions (N, C, H, W) in case of 2D Loss (reference: https://pytorch.org/docs/stable/generated/torch.nn.functional.nll_loss.html). The targets should be of dimension (N), thus I have passed targets[:, 0] to it.

My guess is that the target indexing doesn't match the outputs indexing. CUDA_LAUNCH_BLOCKING =1 by itself is unlikely to solve the issue. Please try to see the indexing of the outputs/targets and dimensions. Also if you have used another dataset with different number of classes, please ensure that the number of classes is updated everywhere. The number of classes should be same in outputs and targets, and the class indexing should also be consistent between outputs and targets.

There is also a chance that if you used your own custom dataset, you forgot to update the weights being passed to the criterion (note that this code uses a weighted cross entropy loss).

I can't understand much from your log. If this issue is still relevant to you, please reopen the issue and share the entire log file.

prachigarg23 / MDIL-SS

CUDA error while calculating loss #2