happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)
MIT License
419 stars 77 forks source link

Training on Own Dataset - on short videos #56

Closed rafiko1 closed 1 year ago

rafiko1 commented 1 year ago

Hi,

I'd like to train on shorter video sizes of maximum 15 seconds long. The action can last up to 1 second. When using very similar setting as thumos14.json, I am getting following error:

  File "/root/actionformer_release/libs/modeling/meta_archs.py", line 544, in label_points_single_video
    reg_targets = reg_targets[range(num_pts), min_len_inds]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'

For the i3d feature extraction I also used:

feat_stride: 4,
num_frames: 16

I am using RGB only, so changed the input dim to 1024. The rest of model config I kept the same.

Can you help with how to solve the error? Will actionformer also work well for shorter videos and what are the recommended settings in this case?

Thanks, also for the great code/paper!

tzzcl commented 1 year ago

Can you please first run the code with CUDA_LAUNCH_BLOCKING=1? this will find the true code which results in the error.

rafiko1 commented 1 year ago

This is the error I am getting:

Start training model LocPointTransformer ...

[Train]: Epoch 0 started
/opt/conda/conda-bld/pytorch_1656352464346/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "./train.py", line 187, in <module>
    main(args)
  File "./train.py", line 138, in main
    print_freq=args.print_freq
  File "/root/actionformer_release/libs/utils/train_utils.py", line 277, in train_one_epoch
    losses = model(video_list)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/actionformer_release/libs/modeling/meta_archs.py", line 375, in forward
    points, gt_segments, gt_labels)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/actionformer_release/libs/modeling/meta_archs.py", line 448, in label_points
    concat_points, gt_segment, gt_label
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/actionformer_release/libs/modeling/meta_archs.py", line 536, in label_points_single_video
    gt_label, self.num_classes
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352464346/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8aeec44477 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1d4a3 (0x7f8b1c39f4a3 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f8b1c3a5417 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x46e3e8 (0x7f8b2e9483e8 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f8aeec27d95 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x35f9c5 (0x7f8b2e8399c5 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x6b0130 (0x7f8b2eb8a130 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x308 (0x7f8b2eb8a538 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #24: __libc_start_main + 0xe7 (0x7f8b6179fc87 in /lib/x86_64-linux-gnu/libc.so.6)
tzzcl commented 1 year ago

I'm not sure whether you re-run ActionFormer with the environment flag CUDA_LAUNCH_BLOCKING=1. If so, I think the problem lies in the loss generation part, you should check that the groundtruth class label does not exceed the pre-defined num_classes.

rafiko1 commented 1 year ago

You are right. It seems the maximum ground truth class should be num_classes-1. This solved it thank you!

Related to what I asked about shorter videos - are there recommended settings for my case? i.e. 15 seconds long videos with actions up to 1 seconds. Should I modify e.g. max_seq_len, backbone_arch, regression_range, n_mha_win_size?

happyharrycn commented 1 year ago

We have not tried ActionFormer on short videos. And, yes, you will need to modify those model parameters in order to maximize the performance. Your setting is somewhat similar to our setting on ActivityNet. You can refer to this file for selecting the model parameters. Here are some of my comments.

rafiko1 commented 1 year ago

Thank you for your help. Issue resolved.