RuntimeError: CUDA error: device-side assert triggered

MaarufB commented 3 years ago

I run this command to train st-gcn model: mmskl configs/recognition/st_gcn/dataset_example/train.yaml

Load configuration information from configs/recognition/st_gcn/dataset_example/train.yaml INFO:mmcv.runner.runner:Start running, host: ai-pose@aipose-X570-GAMING-X, work_dir: /home/ai-pose/Desktop/Ma-aruf/Trials/Trial1/mmskeleton/work_dir/recognition/st_gcn/custom_dataset INFO:mmcv.runner.runner:workflow: [('train', 5), ('val', 1)], max: 65 epochs /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype , Dtype , Dtype , long , Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [1,0,0] Assertion t >= 0 && t < n_classes failed. Traceback (most recent call last): File "/home/ai-pose/anaconda3/envs/mm-test/bin/mmskl", line 7, in exec(compile(f.read(), file, 'exec')) File "/home/ai-pose/Desktop/Ma-aruf/Trials/Trial1/mmskeleton/tools/mmskl", line 131, in main() File "/home/ai-pose/Desktop/Ma-aruf/Trials/Trial1/mmskeleton/tools/mmskl", line 121, in main call_obj(cfg.processor_cfg) File "/home/ai-pose/Desktop/Ma-aruf/Trials/Trial1/mmskeleton/mmskeleton/utils/importer.py", line 24, in call_obj return import_obj(type)(kwargs) File "/home/ai-pose/Desktop/Ma-aruf/Trials/Trial1/mmskeleton/mmskeleton/processor/recognition.py", line 120, in train runner.run(data_loaders, workflow, total_epochs, loss=loss) File "/home/ai-pose/anaconda3/envs/mm-test/lib/python3.7/site-packages/mmcv/runner/runner.py", line 359, in run epoch_runner(data_loaders[i], kwargs) File "/home/ai-pose/anaconda3/envs/mm-test/lib/python3.7/site-packages/mmcv/runner/runner.py", line 263, in train self.model, data_batch, train_mode=True, kwargs) File "/home/ai-pose/Desktop/Ma-aruf/Trials/Trial1/mmskeleton/mmskeleton/processor/recognition.py", line 135, in batch_processor log_vars = dict(loss=losses.item()) RuntimeError: CUDA error: device-side assert triggered

zren2 commented 3 years ago

I also get this issue, I know the reason is " The category_id will be set to -1 if the category annotations miss." https://github.com/pytorch/pytorch/issues/1204 the input for criterion should satisfy t >= 0 && t < n_classes. Maybe you can try to change the label -1 to a large number.

renlle commented 2 years ago

in the CUSTOM_DATASET.md, I got this err by using my own datasets but not change the params of num_class: 3 in the 'mmskl configs/recognition/st_gcn/dataset_example/train.yaml's train.yaml.

also you may change the test.yaml of the default param ' num_class: 3' to your real class numbers.

zzy0222 commented 1 year ago

The problem is solved by change the indices of label from [1, N] to [0, N-1]. After debugging, I found error occured on the following 284. (./mmskeleton/mmskeleton/processor/recognition.py)

I checked the official documentation, and knonw that all indices in range [0, C].

Successful screenshot:

open-mmlab / mmskeleton

RuntimeError: CUDA error: device-side assert triggered #400