Training on Own Dataset - on short videos

rafiko1 commented 1 year ago

Hi,

I'd like to train on shorter video sizes of maximum 15 seconds long. The action can last up to 1 second. When using very similar setting as thumos14.json, I am getting following error:

  File "/root/actionformer_release/libs/modeling/meta_archs.py", line 544, in label_points_single_video
    reg_targets = reg_targets[range(num_pts), min_len_inds]
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'

For the i3d feature extraction I also used:

feat_stride: 4,
num_frames: 16

I am using RGB only, so changed the input dim to 1024. The rest of model config I kept the same.

Can you help with how to solve the error? Will actionformer also work well for shorter videos and what are the recommended settings in this case?

Thanks, also for the great code/paper!

tzzcl commented 1 year ago

Can you please first run the code with CUDA_LAUNCH_BLOCKING=1? this will find the true code which results in the error.

rafiko1 commented 1 year ago

This is the error I am getting:

Start training model LocPointTransformer ...

[Train]: Epoch 0 started
/opt/conda/conda-bld/pytorch_1656352464346/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "./train.py", line 187, in <module>
    main(args)
  File "./train.py", line 138, in main
    print_freq=args.print_freq
  File "/root/actionformer_release/libs/utils/train_utils.py", line 277, in train_one_epoch
    losses = model(video_list)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/actionformer_release/libs/modeling/meta_archs.py", line 375, in forward
    points, gt_segments, gt_labels)
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/actionformer_release/libs/modeling/meta_archs.py", line 448, in label_points
    concat_points, gt_segment, gt_label
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/root/actionformer_release/libs/modeling/meta_archs.py", line 536, in label_points_single_video
    gt_label, self.num_classes
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1656352464346/work/c10/cuda/CUDACachingAllocator.cpp:1387 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8aeec44477 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1d4a3 (0x7f8b1c39f4a3 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x237 (0x7f8b1c3a5417 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x46e3e8 (0x7f8b2e9483e8 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f8aeec27d95 in /opt/conda/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #5: <unknown function> + 0x35f9c5 (0x7f8b2e8399c5 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x6b0130 (0x7f8b2eb8a130 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x308 (0x7f8b2eb8a538 in /opt/conda/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #24: __libc_start_main + 0xe7 (0x7f8b6179fc87 in /lib/x86_64-linux-gnu/libc.so.6)

tzzcl commented 1 year ago

I'm not sure whether you re-run ActionFormer with the environment flag CUDA_LAUNCH_BLOCKING=1. If so, I think the problem lies in the loss generation part, you should check that the groundtruth class label does not exceed the pre-defined num_classes.

rafiko1 commented 1 year ago

You are right. It seems the maximum ground truth class should be num_classes-1. This solved it thank you!

Related to what I asked about shorter videos - are there recommended settings for my case? i.e. 15 seconds long videos with actions up to 1 seconds. Should I modify e.g. max_seq_len, backbone_arch, regression_range, n_mha_win_size?

happyharrycn commented 1 year ago

We have not tried ActionFormer on short videos. And, yes, you will need to modify those model parameters in order to maximize the performance. Your setting is somewhat similar to our setting on ActivityNet. You can refer to this file for selecting the model parameters. Here are some of my comments.

max_seq_len should be reduced to match the maximum input sequence length. Let us say you have 15 seconds long videos with 30 FPS and a feature stride of 4. The maximum sequence length is around 112. Note that the max_seq_len should be divisible by the maximum downsampling rate on the feature pyramid. Setting max_seq_len to 128 seems a reasonable starting point.
backbone_arch and regression_range have to be changed accordingly to match the distribution of the action durations. Given that the actions are up to 1 second, I would recommend reducing the pyramid levels and tweaking the regression range, i.e., something like backbone_arch = [2, 2, 1] (2 levels of feature pyramid) and regression_range = [[0, 4], [4, 10000]] (the first level detect actions up to 0.53 sec and the second for the rest).
n_mha_win_size will need to match the size of the feature pyramid defined by max_seq_len, backbone_arch and regression_range, yet does not make a major impact to the performance. If you follow the suggested parameters listed above, the possible values are 5, 9, 17, 33, and -1 (using global self-attention). 9 could be a good starting point.

rafiko1 commented 1 year ago

Thank you for your help. Issue resolved.

happyharrycn / actionformer_release

Training on Own Dataset - on short videos #56