happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)
MIT License
430 stars 77 forks source link

About n_mha_win_size in eval. #68

Closed EdenGabriel closed 1 year ago

EdenGabriel commented 1 year ago

Hi, Thank you for your excellent work firstly. I have a question about n_mha_win_size, when i set it to [9,9,9,19,19,19] and arch is (2,6,0) for EPIC100, the process of training is success, but the eval progress is failed. In the ./libs/modeling/blocks.py, the function _sliding_chunks_query_key_matual(), the seq_len is not 2304, why? Looking forward to your reply. Thanks.

happyharrycn commented 1 year ago

n_mha_win_size specifies the local window size used to compute self-attention for Transformer blocks. During training, an input video is randomly truncated to max_seq_len (e.g., 2304), which should be divisible by the local window size. During inference, a full video is first padded (so that its sequence length is divisible by the local window size), and then fed into the model. Therefore, the input sequence length at inference is controlled by the video duration and not related to the max_sequence_length (2304).

Your setting (arch=[2, 6, 0]) disables the FPN with the feature maps preserving input sequence length. This will likely lead to out-of-memory issues at inference time. You might want to post the error message here for further help from us.

EdenGabriel commented 1 year ago

Thanks for your patient reply. I got it. But i have some questions about your reply. ①You said "an input video is randomly truncated to max_seq_len (e.g., 2304)", i wonder where "randomly" comes from in the code. (def preprocessing()?) ②Another question: if i set arch=[2, 2, 5], but downsample scale is set to 1, does that mean disables the FPN? ③And for ek100, i find the better performance comes from the situation where neck is fpn-identity, does that mean no fused fpn features? Looking forward to your reply. Thank you very much. Have a nice day.

tzzcl commented 1 year ago

For your questions,

  1. Please refer to the trunc_feat example in epic_kitchens.py for details. we randomly truncated the input feature to the max_seq_len.

2/3. We just use multi-layer feature maps, not the FPN structure (extra 1d conv + bottom-up interpolation and sum) in ActionFormer. We found that FPN does not bring extra performance gain to our model. Moreover, if the downsample scale is set to 1, it disables multi-layer feature maps (the output resolution of different feature maps will be same).

EdenGabriel commented 1 year ago

wow... Thank you for your quick reply. Ok, i got it. Thank you for your excellent work again.