Closed EdenGabriel closed 1 year ago
n_mha_win_size specifies the local window size used to compute self-attention for Transformer blocks. During training, an input video is randomly truncated to max_seq_len (e.g., 2304), which should be divisible by the local window size. During inference, a full video is first padded (so that its sequence length is divisible by the local window size), and then fed into the model. Therefore, the input sequence length at inference is controlled by the video duration and not related to the max_sequence_length (2304).
Your setting (arch=[2, 6, 0]) disables the FPN with the feature maps preserving input sequence length. This will likely lead to out-of-memory issues at inference time. You might want to post the error message here for further help from us.
Thanks for your patient reply. I got it. But i have some questions about your reply. ①You said "an input video is randomly truncated to max_seq_len (e.g., 2304)", i wonder where "randomly" comes from in the code. (def preprocessing()?) ②Another question: if i set arch=[2, 2, 5], but downsample scale is set to 1, does that mean disables the FPN? ③And for ek100, i find the better performance comes from the situation where neck is fpn-identity, does that mean no fused fpn features? Looking forward to your reply. Thank you very much. Have a nice day.
For your questions,
2/3. We just use multi-layer feature maps, not the FPN structure (extra 1d conv + bottom-up interpolation and sum) in ActionFormer. We found that FPN does not bring extra performance gain to our model. Moreover, if the downsample scale is set to 1, it disables multi-layer feature maps (the output resolution of different feature maps will be same).
wow... Thank you for your quick reply. Ok, i got it. Thank you for your excellent work again.
Hi, Thank you for your excellent work firstly. I have a question about n_mha_win_size, when i set it to [9,9,9,19,19,19] and arch is (2,6,0) for EPIC100, the process of training is success, but the eval progress is failed. In the ./libs/modeling/blocks.py, the function _sliding_chunks_query_key_matual(), the seq_len is not 2304, why? Looking forward to your reply. Thanks.