Closed QX-N closed 1 year ago
I cannot fully understand your questions here. The regression range of each FPN feature will be divided by the stride of each FPN feature. Combined with local self-attention + convolutional-based downsampling, high-level FPN features will have larger receptive fields than low-level FPN features. Actually, ActionFormer performs better than previous methods especially on long-ranged actions.
Thanks for your reply! I want to express that after the local self-attention and downsampling are over, the high-level semantic features of Fpn have gained a larger receptive field. Still, at the same time, some detailed semantic information has been lost, which should be necessary for the help of localization.
Hi, I still cannot fully understand your opinions. Since we have different regression ranges for each layer, low-level features will have small regression ranges (to deal with short actions) while high-level features will have larger regression ranges (to deal with long actions). Considering the AP metric with tIoU, especially with tIoU (a percentage metric), though some details are lost, we can still have a relatively accurate prediction for long actions.
I notice that you use the independently different range for different levels of FPN features, but I think the high-level Fpn feature cannot have enough information for the long-range level regression, thus causing the low performance of the long-range video. I wonder if you have any insight on this issue. I would appreciate it very much if you could answer me.