open-mmlab / mmaction2

OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
https://mmaction2.readthedocs.io
Apache License 2.0
4.28k stars 1.25k forks source link

Training error for 2s-agcn #1450

Closed nyanmn closed 2 years ago

nyanmn commented 2 years ago

I am training 2s-agcn. Raw skeleton data are downloaded from here. Converted to mmaction2 format using gen_ntu_rgbd_raw.py. So have two foldersxsub and xviewafter conversion.

Then the follow command is used to train. python tools/train.py configs/skeleton/2s-agcn/2sagcn_80e_ntu60_xsub_keypoint_3d.py --work-dir work_dirs/2sagcn_80e_ntu60_xsub_keypoint_3d --validate --seed 0 --deterministic

The whole errors are as follows. What could be wrong?

/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "tools/train.py", line 205, in <module>
    main()
  File "tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 154, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 97, in _parse_losses
    log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f7c8eecd8b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f7c8f11f982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7c8eeb8b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f7ccc207b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f7ccc207c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf5 (0x7f7cf6df93d5 in /lib64/libc.so.6)

Aborted (core dumped)
kennymckormick commented 2 years ago

@gengenkai, plz check this issue

gengenkai commented 2 years ago

terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered

Please use ’CUDA_LAUNCH_BLOCKING=1 python tools/train.py configs/skeleton/2s-agcn/2sagcn_80e_ntu60_xsub_keypoint_3d.py --work-dir work_dirs/2sagcn_80e_ntu60_xsub_keypoint_3d --validate --seed 0 --deterministic‘ to localize the error more precisely. Usually, this error indicates there is indexes out of boundary.

nyanmn commented 2 years ago

The command is changed to

CUDA_LAUNCH_BLOCKING=1 python tools/train.py configs/skeleton/2s-agcn/2sagcn_80e_ntu60_xsub_keypoint_3d.py --work-dir work_dirs/2sagcn_80e_ntu60_xsub_keypoint_3d --validate --seed 0 --deterministic

The errors are

2022-02-17 15:16:24,803 - mmaction - INFO - workflow: [('train', 1)], max: 80 epochs
2022-02-17 15:16:24,803 - mmaction - INFO - Checkpoints will be saved to /home/sysadmin/Nyan/mmaction2/work_dirs/2sagcn_80e_ntu60_xsub_keypoint_3d by HardDiskBackend.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=115 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "tools/train.py", line 205, in <module>
    main()
  File "tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step
    losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    return self.forward_train(keypoint, label, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
    loss = self.cls_head.loss(output, gt_labels)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss
    loss_cls = self.loss_cls(cls_score, labels, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 38, in forward
    ret = self._forward(*args, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/cross_entropy_loss.py", line 81, in _forward
    loss_cls = F.cross_entropy(cls_score, label, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/functional.py", line 2468, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/functional.py", line 2264, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:115
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f69084548b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f69086a6982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f690843fb7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f694578eb7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f694578ec26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf5 (0x7f69703803d5 in /lib64/libc.so.6)

Aborted (core dumped)
kennymckormick commented 2 years ago

@gengenkai any progress?

gengenkai commented 2 years ago

The command is changed to

CUDA_LAUNCH_BLOCKING=1 python tools/train.py configs/skeleton/2s-agcn/2sagcn_80e_ntu60_xsub_keypoint_3d.py --work-dir work_dirs/2sagcn_80e_ntu60_xsub_keypoint_3d --validate --seed 0 --deterministic

The errors are

2022-02-17 15:16:24,803 - mmaction - INFO - workflow: [('train', 1)], max: 80 epochs
2022-02-17 15:16:24,803 - mmaction - INFO - Checkpoints will be saved to /home/sysadmin/Nyan/mmaction2/work_dirs/2sagcn_80e_ntu60_xsub_keypoint_3d by HardDiskBackend.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=115 error=710 : device-side assert triggered
Traceback (most recent call last):
  File "tools/train.py", line 205, in <module>
    main()
  File "tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/data_parallel.py", line 75, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step
    losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    return self.forward_train(keypoint, label, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
    loss = self.cls_head.loss(output, gt_labels)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss
    loss_cls = self.loss_cls(cls_score, labels, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 38, in forward
    ret = self._forward(*args, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/cross_entropy_loss.py", line 81, in _forward
    loss_cls = F.cross_entropy(cls_score, label, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/functional.py", line 2468, in cross_entropy
    return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/functional.py", line 2264, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (710) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:115
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f69084548b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f69086a6982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f690843fb7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f694578eb7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f694578ec26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf5 (0x7f69703803d5 in /lib64/libc.so.6)

Aborted (core dumped)

Hi, I have tried this config and the training process did not occur this bug. Maybe you could check the label of your data and make sure the data label is less than the total class of your data.

gengenkai commented 2 years ago

@gengenkai any progress?

Have tested it, no errors.

nyanmn commented 2 years ago

Wired! I have keep on getting same errors. This time tested has the following errors.

2022-02-19 13:44:14,937 - mmaction - INFO - workflow: [('train', 1)], max: 80 epochs
2022-02-19 13:44:14,937 - mmaction - INFO - Checkpoints will be saved to /home/sysadmin/Nyan/mmaction2/work_dirs/2sagcn_80e_ntu60_xsub_keypoint_3d by HardDiskBackend.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "./tools/train.py", line 205, in <module>
    main()
  File "./tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
Traceback (most recent call last):
Traceback (most recent call last):
  File "./tools/train.py", line 205, in <module>
  File "./tools/train.py", line 205, in <module>
    main()
  File "./tools/train.py", line 201, in main
    main()
  File "./tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
        output = self.module.train_step(*inputs[0], **kwargs[0])output = self.module.train_step(*inputs[0], **kwargs[0])
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step

  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step
    losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    return self.forward_train(keypoint, label, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    return self.forward_train(keypoint, label, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
        loss = self.cls_head.loss(output, gt_labels)
    loss = self.cls_head.loss(output, gt_labels)return self.forward_train(keypoint, label, **kwargs)  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss

  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss
    loss = self.cls_head.loss(output, gt_labels)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss
Traceback (most recent call last):
  File "./tools/train.py", line 205, in <module>
    main()
  File "./tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step
    losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    return self.forward_train(keypoint, label, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
    loss = self.cls_head.loss(output, gt_labels)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss
    loss_cls = self.loss_cls(cls_score, labels, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    loss_cls = self.loss_cls(cls_score, labels, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    loss_cls = self.loss_cls(cls_score, labels, **kwargs)    loss_cls = self.loss_cls(cls_score, labels, **kwargs)

  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 44, in forward
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 44, in forward
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 44, in forward
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 44, in forward
    ret *= self.loss_weight
RuntimeError    : CUDA error: device-side assert triggered
        ret *= self.loss_weightret *= self.loss_weightret *= self.loss_weight

RuntimeError
RuntimeError: CUDA error: device-side assert triggered: CUDA error: device-side assert triggered
RuntimeError
: CUDA error: device-side assert triggered
Traceback (most recent call last):
  File "./tools/train.py", line 205, in <module>
    main()
  File "./tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step
    losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    return self.forward_train(keypoint, label, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
    loss = self.cls_head.loss(output, gt_labels)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss
    loss_cls = self.loss_cls(cls_score, labels, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 44, in forward
    ret *= self.loss_weight
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
  File "./tools/train.py", line 205, in <module>
    main()
  File "./tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 154, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 97, in _parse_losses
    log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f2f1dfd68b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f2f1e228982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f2f1dfc1b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f2f5b310b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f2f5b310c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x5624dd9737da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfbfa9 (0x5624dd8edfa9 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfa8c8 (0x5624dd8ec8c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfa8c8 (0x5624dd8ec8c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfa2d8 (0x5624dd8ec2d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad68 (0x5624dd8ecd68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: <unknown function> + 0x12b327 (0x5624dd91d327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x5624dd929e59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x5624dd99ed0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x5624dda13304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: <unknown function> + 0x232960 (0x5624dda24960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x5624dda24ccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #23: __libc_start_main + 0xf5 (0x7f2f85f023d5 in /lib64/libc.so.6)
frame #24: <unknown function> + 0x1d7555 (0x5624dd9c9555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fc22fd528b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fc22ffa4982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc22fd3db7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7fc26d08cb7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7fc26d08cc26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x55bb791577da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfbfa9 (0x55bb790d1fa9 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfa8c8 (0x55bb790d08c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfa8c8 (0x55bb790d08c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfa2d8 (0x55bb790d02d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad68 (0x55bb790d0d68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: <unknown function> + 0x12b327 (0x55bb79101327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x55bb7910de59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x55bb79182d0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x55bb791f7304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: <unknown function> + 0x232960 (0x55bb79208960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x55bb79208ccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #23: __libc_start_main + 0xf5 (0x7fc297c7e3d5 in /lib64/libc.so.6)
frame #24: <unknown function> + 0x1d7555 (0x55bb791ad555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f652ef1e8b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f652f170982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f652ef09b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f656c258b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f656c258c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x56253327f7da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfbfa9 (0x5625331f9fa9 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfa8c8 (0x5625331f88c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfa8c8 (0x5625331f88c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfa2d8 (0x5625331f82d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad68 (0x5625331f8d68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: <unknown function> + 0x12b327 (0x562533229327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x562533235e59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x5625332aad0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x56253331f304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: <unknown function> + 0x232960 (0x562533330960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x562533330ccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #23: __libc_start_main + 0xf5 (0x7f6596e4a3d5 in /lib64/libc.so.6)
frame #24: <unknown function> + 0x1d7555 (0x5625332d5555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f716a00d8b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f716a25f982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7169ff8b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f71a7347b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f71a7347c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x556ac77bd7da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfbfa9 (0x556ac7737fa9 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfa8c8 (0x556ac77368c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfa8c8 (0x556ac77368c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfa2d8 (0x556ac77362d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad68 (0x556ac7736d68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: <unknown function> + 0x12b327 (0x556ac7767327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x556ac7773e59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x556ac77e8d0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x556ac785d304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: <unknown function> + 0x232960 (0x556ac786e960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x556ac786eccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #23: __libc_start_main + 0xf5 (0x7f71d1f393d5 in /lib64/libc.so.6)
frame #24: <unknown function> + 0x1d7555 (0x556ac7813555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f611a0768b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f611a2c8982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f611a061b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f61573b0b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f61573b0c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x557f69ef57da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfbfa9 (0x557f69e6ffa9 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfa8c8 (0x557f69e6e8c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfa8c8 (0x557f69e6e8c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfa2d8 (0x557f69e6e2d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad68 (0x557f69e6ed68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: <unknown function> + 0x12b327 (0x557f69e9f327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x557f69eabe59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x557f69f20d0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x557f69f95304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: <unknown function> + 0x232960 (0x557f69fa6960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x557f69fa6ccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #23: __libc_start_main + 0xf5 (0x7f6181fa23d5 in /lib64/libc.so.6)
frame #24: <unknown function> + 0x1d7555 (0x557f69f4b555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f52f3b1b8b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f52f3d6d982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f52f3b06b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f5330e55b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f5330e55c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x55e644b6f7da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfa2d8 (0x55e644ae82d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfad68 (0x55e644ae8d68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0x12b327 (0x55e644b19327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: PyDict_SetItemString + 0x89 (0x55e644b25e59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyImport_Cleanup + 0xab (0x55e644b9ad0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: Py_FinalizeEx + 0x64 (0x55e644c0f304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: <unknown function> + 0x232960 (0x55e644c20960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: _Py_UnixMain + 0x3c (0x55e644c20ccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: __libc_start_main + 0xf5 (0x7f535ba473d5 in /lib64/libc.so.6)
frame #23: <unknown function> + 0x1d7555 (0x55e644bc5555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

Traceback (most recent call last):
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/sysadmin/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=7', 'configs/skeleton/2s-agcn/2sagcn_80e_ntu60_xsub_keypoint_3d.py', '--launcher', 'pytorch', '--validate', '--seed', '0', '--deterministic']' died with <Signals.SIGABRT: 6>.
(open-mmlab) [sysadmin@traininglab mmaction2]$
gengenkai commented 2 years ago

Wired! I have keep on getting same errors. This time tested has the following errors.

2022-02-19 13:44:14,937 - mmaction - INFO - workflow: [('train', 1)], max: 80 epochs
2022-02-19 13:44:14,937 - mmaction - INFO - Checkpoints will be saved to /home/sysadmin/Nyan/mmaction2/work_dirs/2sagcn_80e_ntu60_xsub_keypoint_3d by HardDiskBackend.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [11,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [4,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [1,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1603729006826/work/aten/src/THCUNN/ClassNLLCriterion.cu:108: cunn_ClassNLLCriterion_updateOutput_kernel: block: [0,0,0], thread: [5,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "./tools/train.py", line 205, in <module>
    main()
  File "./tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
Traceback (most recent call last):
Traceback (most recent call last):
  File "./tools/train.py", line 205, in <module>
  File "./tools/train.py", line 205, in <module>
    main()
  File "./tools/train.py", line 201, in main
    main()
  File "./tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
        output = self.module.train_step(*inputs[0], **kwargs[0])output = self.module.train_step(*inputs[0], **kwargs[0])
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step

  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step
    losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
        losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    return self.forward_train(keypoint, label, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    return self.forward_train(keypoint, label, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
        loss = self.cls_head.loss(output, gt_labels)
    loss = self.cls_head.loss(output, gt_labels)return self.forward_train(keypoint, label, **kwargs)  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss

  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss
    loss = self.cls_head.loss(output, gt_labels)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss
Traceback (most recent call last):
  File "./tools/train.py", line 205, in <module>
    main()
  File "./tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step
    losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    return self.forward_train(keypoint, label, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
    loss = self.cls_head.loss(output, gt_labels)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss
    loss_cls = self.loss_cls(cls_score, labels, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    loss_cls = self.loss_cls(cls_score, labels, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    loss_cls = self.loss_cls(cls_score, labels, **kwargs)    loss_cls = self.loss_cls(cls_score, labels, **kwargs)

  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 44, in forward
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 44, in forward
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 44, in forward
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 44, in forward
    ret *= self.loss_weight
RuntimeError    : CUDA error: device-side assert triggered
        ret *= self.loss_weightret *= self.loss_weightret *= self.loss_weight

RuntimeError
RuntimeError: CUDA error: device-side assert triggered: CUDA error: device-side assert triggered
RuntimeError
: CUDA error: device-side assert triggered
Traceback (most recent call last):
  File "./tools/train.py", line 205, in <module>
    main()
  File "./tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 152, in train_step
    losses = self(skeletons, label, return_loss=True)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 106, in forward
    return self.forward_train(keypoint, label, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/skeletongcn.py", line 18, in forward_train
    loss = self.cls_head.loss(output, gt_labels)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/heads/base.py", line 102, in loss
    loss_cls = self.loss_cls(cls_score, labels, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/losses/base.py", line 44, in forward
    ret *= self.loss_weight
RuntimeError: CUDA error: device-side assert triggered
Traceback (most recent call last):
  File "./tools/train.py", line 205, in <module>
    main()
  File "./tools/train.py", line 201, in main
    meta=meta)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/apis/train.py", line 204, in train_model
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs, **runner_kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 127, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 52, in train_step
    output = self.module.train_step(*inputs[0], **kwargs[0])
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 154, in train_step
    loss, log_vars = self._parse_losses(losses)
  File "/home/sysadmin/Nyan/mmaction2/mmaction/models/skeleton_gcn/base.py", line 97, in _parse_losses
    log_vars[loss_name] = loss_value.item()
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f2f1dfd68b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f2f1e228982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f2f1dfc1b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f2f5b310b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f2f5b310c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x5624dd9737da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfbfa9 (0x5624dd8edfa9 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfa8c8 (0x5624dd8ec8c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfa8c8 (0x5624dd8ec8c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfa2d8 (0x5624dd8ec2d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad68 (0x5624dd8ecd68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0xfad7c (0x5624dd8ecd7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: <unknown function> + 0x12b327 (0x5624dd91d327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x5624dd929e59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x5624dd99ed0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x5624dda13304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: <unknown function> + 0x232960 (0x5624dda24960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x5624dda24ccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #23: __libc_start_main + 0xf5 (0x7f2f85f023d5 in /lib64/libc.so.6)
frame #24: <unknown function> + 0x1d7555 (0x5624dd9c9555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fc22fd528b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fc22ffa4982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fc22fd3db7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7fc26d08cb7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7fc26d08cc26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x55bb791577da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfbfa9 (0x55bb790d1fa9 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfa8c8 (0x55bb790d08c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfa8c8 (0x55bb790d08c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfa2d8 (0x55bb790d02d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad68 (0x55bb790d0d68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0xfad7c (0x55bb790d0d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: <unknown function> + 0x12b327 (0x55bb79101327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x55bb7910de59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x55bb79182d0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x55bb791f7304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: <unknown function> + 0x232960 (0x55bb79208960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x55bb79208ccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #23: __libc_start_main + 0xf5 (0x7fc297c7e3d5 in /lib64/libc.so.6)
frame #24: <unknown function> + 0x1d7555 (0x55bb791ad555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f652ef1e8b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f652f170982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f652ef09b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f656c258b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f656c258c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x56253327f7da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfbfa9 (0x5625331f9fa9 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfa8c8 (0x5625331f88c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfa8c8 (0x5625331f88c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfa2d8 (0x5625331f82d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad68 (0x5625331f8d68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0xfad7c (0x5625331f8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: <unknown function> + 0x12b327 (0x562533229327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x562533235e59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x5625332aad0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x56253331f304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: <unknown function> + 0x232960 (0x562533330960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x562533330ccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #23: __libc_start_main + 0xf5 (0x7f6596e4a3d5 in /lib64/libc.so.6)
frame #24: <unknown function> + 0x1d7555 (0x5625332d5555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f716a00d8b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f716a25f982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7169ff8b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f71a7347b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f71a7347c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x556ac77bd7da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfbfa9 (0x556ac7737fa9 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfa8c8 (0x556ac77368c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfa8c8 (0x556ac77368c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfa2d8 (0x556ac77362d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad68 (0x556ac7736d68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0xfad7c (0x556ac7736d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: <unknown function> + 0x12b327 (0x556ac7767327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x556ac7773e59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x556ac77e8d0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x556ac785d304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: <unknown function> + 0x232960 (0x556ac786e960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x556ac786eccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #23: __libc_start_main + 0xf5 (0x7f71d1f393d5 in /lib64/libc.so.6)
frame #24: <unknown function> + 0x1d7555 (0x556ac7813555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f611a0768b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f611a2c8982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f611a061b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f61573b0b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f61573b0c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x557f69ef57da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfbfa9 (0x557f69e6ffa9 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfa8c8 (0x557f69e6e8c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfa8c8 (0x557f69e6e8c8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfa2d8 (0x557f69e6e2d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad68 (0x557f69e6ed68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0xfad7c (0x557f69e6ed7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: <unknown function> + 0x12b327 (0x557f69e9f327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyDict_SetItemString + 0x89 (0x557f69eabe59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: PyImport_Cleanup + 0xab (0x557f69f20d0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: Py_FinalizeEx + 0x64 (0x557f69f95304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: <unknown function> + 0x232960 (0x557f69fa6960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: _Py_UnixMain + 0x3c (0x557f69fa6ccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #23: __libc_start_main + 0xf5 (0x7f6181fa23d5 in /lib64/libc.so.6)
frame #24: <unknown function> + 0x1d7555 (0x557f69f4b555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1603729006826/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f52f3b1b8b2 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f52f3d6d982 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f52f3b06b7d in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x5fbb7a (0x7f5330e55b7a in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5fbc26 (0x7f5330e55c26 in /home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x1817da (0x55e644b6f7da in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #6: <unknown function> + 0xfa2d8 (0x55e644ae82d8 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #7: <unknown function> + 0xfad68 (0x55e644ae8d68 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #8: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #9: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #10: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #11: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #12: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #13: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #14: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #15: <unknown function> + 0xfad7c (0x55e644ae8d7c in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #16: <unknown function> + 0x12b327 (0x55e644b19327 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #17: PyDict_SetItemString + 0x89 (0x55e644b25e59 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #18: PyImport_Cleanup + 0xab (0x55e644b9ad0b in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #19: Py_FinalizeEx + 0x64 (0x55e644c0f304 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #20: <unknown function> + 0x232960 (0x55e644c20960 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #21: _Py_UnixMain + 0x3c (0x55e644c20ccc in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)
frame #22: __libc_start_main + 0xf5 (0x7f535ba473d5 in /lib64/libc.so.6)
frame #23: <unknown function> + 0x1d7555 (0x55e644bc5555 in /home/sysadmin/anaconda3/envs/open-mmlab/bin/python)

Traceback (most recent call last):
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/sysadmin/anaconda3/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/sysadmin/anaconda3/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=7', 'configs/skeleton/2s-agcn/2sagcn_80e_ntu60_xsub_keypoint_3d.py', '--launcher', 'pytorch', '--validate', '--seed', '0', '--deterministic']' died with <Signals.SIGABRT: 6>.
(open-mmlab) [sysadmin@traininglab mmaction2]$

If you are using the ntu60 xsub for training, please first check the 'label' of your data (should be less than 60).

nyanmn commented 2 years ago

In this folder, there is only label file for ntu120 (label_map_ntu120.txt). In the configuration file, there is no path for label file.

nyanmn commented 2 years ago

I used this config file, num_classes=60. Where to check 'label' of data is less than 60.

gengenkai commented 2 years ago

In this folder, there is only label file for ntu120 (label_map_ntu120.txt). In the configuration file, there is no path for label file.

Please refer to our README.md in tools/data/skeleton. Use this command ' python gen_ntu_rgbd_raw.py --data-path your_raw_nturgbd60_skeleton_path --ignored-sample-path NTU_RGBD_samples_with_missing_skeletons.txt --out-folder your_nturgbd60_output_path --task ntu60' then you can get the correct data format for ntu60 dataset.

nyanmn commented 2 years ago

Yes I did that. Please see in my original post. It is reported first as

I am training 2s-agcn.
Raw skeleton data are downloaded from [here](https://github.com/shahroudy/NTURGB-D).
Converted to mmaction2 format using gen_ntu_rgbd_raw.py .
So have two foldersxsub and xviewafter conversion.

Used that command, python gen_ntu_rgbd_raw.py --data-path your_raw_nturgbd60_skeleton_path --ignored-sample-path NTU_RGBD_samples_with_missing_skeletons.txt --out-folder your_nturgbd60_output_path --task ntu60. I have two folders after the process and each has train.pkl and val.pkl

nyanmn commented 2 years ago

I remember sth. I used all files inside

nturgbd_skeletons_s001_to_s017.zip  
nturgbd_skeletons_s018_to_s032.zip

ntu60 is for the first one nturgbd_skeletons_s001_to_s017.zip ? Let me do again.

gengenkai commented 2 years ago

I remember sth. I used all files inside

nturgbd_skeletons_s001_to_s017.zip  
nturgbd_skeletons_s018_to_s032.zip

ntu60 is for the first one nturgbd_skeletons_s001_to_s017.zip ? Let me do again.

Yes. Ntu60 is the data from the first zip while ntu120 from both.

nyanmn commented 2 years ago

Yeah my bad. I used both. Now it works.

gengenkai commented 2 years ago

Yeah my bad. I used both. Now it works.

Good.