Closed makecent closed 5 months ago
The maximum input sequence length is defined on the feature grid, and thus equal to the maximum of clips (video features). I can't recall the details in [77] and will look into this later.
@happyharrycn Can you further describe how you guys came across 2304 as the number to set for max_seq_len? Thanks for such amazing work.
Actually, we have an ablation study in Appendix Table B of ActionFormer. You can find that enlarging the max_seq_len will bring a slight performance boost.
Actually, we have an ablation study in Appendix Table B of ActionFormer. You can find that enlarging the max_seq_len will bring a slight performance boost.
So is it just for training scenario right? Also incase of any features, the value for the same should be the value of the largest feature size available or it can be any value?
Yes it is just for training, and it can be any value (smaller max_seq_len may results in bad results).
Yes it is just for training, and it can be any value (smaller max_seq_len may results in bad results).
I tried out changing the value. With very less max_seq_len, I was not able to train the model due to an assertion in LocPointGenerator(assert feat_len <= buffer_pts.shape[0], "Reached max buffer length for point generator"). Would really appreciate your feedback on same.
For very small max_seq_len, you should increase the max_buffer_len_factor
in the config.
Closed due to inactivity.
I am curious about the unit of the maximum input sequence (max_seq_len). As mentioned in paper, it seems the unit is single-frame:
... When using a input sequence length of 512 (typo here, should be 576), similar to what was considered in [77] (512), our method only has a minor drop in average mAP (-1.1%) and significantly outperforms [77] ...
But in the code, I found it seems the unit of
max_seq_len
is the 4-frame because the features are extracted with stride=4. Therefore, a input feature sequence with max_seq_len of 2304 should cover information of consecutive *2304 4** frames.When compared with the [77], it seem it's not fair to directly compare the 576 with 512 because the 512 used in the [77] represent a consecutive 512 frames, while the 576 in this work represent a feature of 576 frames with interval 4, thus this work has much longer temporal input.