matyasbohacek / spoter

Repository accompanying the "Sign Pose-based Transformer for Word-level Sign Language Recognition" paper
https://spoter.signlanguagerecognition.com
Apache License 2.0
78 stars 24 forks source link

Thank for your work! Please comment,when training ,report another error. #2

Open showfaker66 opened 2 years ago

showfaker66 commented 2 years ago

RuntimeError: CUDA error: device-side assert triggered. ` for i, data in enumerate(dataloader): inputs, labels = data

inputs, labels = Variable(inputs), Variable(labels)-1

    inputs = inputs.squeeze(0).to(device)
    labels = labels.to(device, dtype=torch.long)

    optimizer.zero_grad()
    outputs = model(inputs).expand(1, -1, -1)

    loss = criterion(outputs[0], labels[0])`
matyasbohacek commented 2 years ago

Thank you for reporting! Could you please provide the full error trace? Thank you. (It is always ideal to have the CUDA_LAUNCH_BLOCKING=1 flag when running, so any low-CUDA errors shall be triggered)

showfaker66 commented 2 years ago

Thank you for you reply! The complete error appears below. C:/cb/pytorch_1000000000000/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: block: [0,0,0], thread: [0,0,0] Assertion t >= 0 && t < n_classes failed. Traceback (most recent call last): File "train.py", line 274, in train(args) File "train.py", line 174, in train trainloss, , _, train_acc = train_epoch(slrt_model, train_loader, cel_criterion, sgd_optimizer, device) File "I:\action_recognition\spoter-main-hand-sign\spoter\utils.py", line 25, in train_epoch loss.backward() File "D:\anaconda\envs\ctpgr\lib\site-packages\torch\tensor.py", line 245, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "D:\anaconda\envs\ctpgr\lib\site-packages\torch\autograd__init__.py", line 145, in backward Variable._execution_engine.run_backward( RuntimeError: CUDA error: device-side assert triggered

muhammad-ahmed-ghani commented 2 years ago

@matyasbohacek Hi have you resolved this error ? I am also getting the same error

RodGal-2020 commented 2 years ago

I'm having the same problem, is there a solution for it?

RodGal-2020 commented 2 years ago

Hey, I have found a solution!

Go to datasets/czech_slr_dataset.py, and around line 105, find the following:

label = torch.Tensor([self.labels[idx] - 1])

That -1 is the cause of our problems, because while working with WLASL100, labels go from 0 to 99 and, as a result, when we call the class CzechSLRDataset, we recieve something like tensor([[-1]]), but there is no class labelled with -1. This explains the CUDA error and the t >= 0 & t < num_labels.

Taking that into account, the following fix worked for me:

label = torch.Tensor([self.labels[idx]]) # Just drop the "-1"

Hope this helps! :D