happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)
MIT License
419 stars 77 forks source link

Question about the choice of regression_range. #69

Closed QX-N closed 1 year ago

QX-N commented 1 year ago

I notice that you use the independently different range for different levels of FPN features, but I think the high-level Fpn feature cannot have enough information for the long-range level regression, thus causing the low performance of the long-range video. I wonder if you have any insight on this issue. I would appreciate it very much if you could answer me.

tzzcl commented 1 year ago

I cannot fully understand your questions here. The regression range of each FPN feature will be divided by the stride of each FPN feature. Combined with local self-attention + convolutional-based downsampling, high-level FPN features will have larger receptive fields than low-level FPN features. Actually, ActionFormer performs better than previous methods especially on long-ranged actions.

QX-N commented 1 year ago

Thanks for your reply! I want to express that after the local self-attention and downsampling are over, the high-level semantic features of Fpn have gained a larger receptive field. Still, at the same time, some detailed semantic information has been lost, which should be necessary for the help of localization.

tzzcl commented 1 year ago

Hi, I still cannot fully understand your opinions. Since we have different regression ranges for each layer, low-level features will have small regression ranges (to deal with short actions) while high-level features will have larger regression ranges (to deal with long actions). Considering the AP metric with tIoU, especially with tIoU (a percentage metric), though some details are lost, we can still have a relatively accurate prediction for long actions.