Closed rafiko1 closed 1 year ago
Can you please first run the code with CUDA_LAUNCH_BLOCKING=1? this will find the true code which results in the error.
This is the error I am getting:
Start training model LocPointTransformer ...
[Train]: Epoch 0 started
/opt/conda/conda-bld/pytorch_1656352464346/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
File "./train.py", line 187, in <module>
main(args)
File "./train.py", line 138, in main
print_freq=args.print_freq
File "/root/actionformer_release/libs/utils/train_utils.py", line 277, in train_one_epoch
losses = model(video_list)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
return self.module(*inputs[0], **kwargs[0])
File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/root/actionformer_release/libs/modeling/meta_archs.py", line 375, in forward
points, gt_segments, gt_labels)
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/actionformer_release/libs/modeling/meta_archs.py", line 448, in label_points
concat_points, gt_segment, gt_label
File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/root/actionformer_release/libs/modeling/meta_archs.py", line 536, in label_points_single_video
gt_label, self.num_classes
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352464346/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8aeec44477 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1d4a3 (0x7f8b1c39f4a3 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f8b1c3a5417 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x46e3e8 (0x7f8b2e9483e8 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f8aeec27d95 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x35f9c5 (0x7f8b2e8399c5 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x6b0130 (0x7f8b2eb8a130 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x308 (0x7f8b2eb8a538 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #24: __libc_start_main + 0xe7 (0x7f8b6179fc87 in /lib/x86_64-linux-gnu/libc.so.6)
I'm not sure whether you re-run ActionFormer with the environment flag CUDA_LAUNCH_BLOCKING=1. If so, I think the problem lies in the loss generation part, you should check that the groundtruth class label does not exceed the pre-defined num_classes.
You are right. It seems the maximum ground truth class should be num_classes-1. This solved it thank you!
Related to what I asked about shorter videos - are there recommended settings for my case?
i.e. 15 seconds long videos with actions up to 1 seconds.
Should I modify e.g. max_seq_len
, backbone_arch
, regression_range
, n_mha_win_size
?
We have not tried ActionFormer on short videos. And, yes, you will need to modify those model parameters in order to maximize the performance. Your setting is somewhat similar to our setting on ActivityNet. You can refer to this file for selecting the model parameters. Here are some of my comments.
Thank you for your help. Issue resolved.
Hi,
I'd like to train on shorter video sizes of maximum 15 seconds long. The action can last up to 1 second. When using very similar setting as thumos14.json, I am getting following error:
For the i3d feature extraction I also used:
I am using RGB only, so changed the input dim to 1024. The rest of model config I kept the same.
Can you help with how to solve the error? Will actionformer also work well for shorter videos and what are the recommended settings in this case?
Thanks, also for the great code/paper!