happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)

MIT License

415 stars 77 forks source link

How to extract features of an action after transformer encoder in pyramid architecture? #89

Closed thanhhff closed 1 year ago

thanhhff commented 1 year ago

Hi! Thanks for your great work. I have a question about how to extract features of an action after transformer encoder in pyramid architecture.

For example, for a video with features extracted using I3D with a shape of [2048, 2301], an action has frames starting from 300:500. So, the features of that action using I3D would be [2048, 300:500].

In the pyramid architecture with 6 layers, the feature of the action at layer 1 will be [dim, 300:500]. As layer 2 halves the size, the feature will be [dim, 150:250], is that correct?

Thank you.

tzzcl commented 1 year ago

For your question, one action will probably only lies in a certain layer of the pyramid rather than multiple layers with the current regression range limitation plus center sampling.

thanhhff commented 1 year ago

Thank you for getting back to me. I understand that the model architecture in your Table D states that the input for Layer 2 is the previous layer (Layer 1).

For your question, one action will probably only lies in a certain layer of the pyramid rather than multiple layers with the current regression range limitation plus center sampling.

Sorry for my misunderstanding, this suggests that there may be actions that are only directed to Layer 2 and not to Layer 1? Also, could you please explain how you calculate the regression range for each action in your code?

Thank you.

tzzcl commented 1 year ago

84 has a example to explain the regression range and assignment of action distribution, you can refer to #84 for more details.

thanhhff commented 1 year ago

Thank you for your help.