happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)
MIT License
444 stars 78 forks source link

PointGenerato #39

Closed QingLuogwj closed 2 years ago

QingLuogwj commented 2 years ago

Hello, I want to ask, what is the PointGenerator in the code?

tzzcl commented 2 years ago

In general, The PointGenerator just generate the real time (in sec) for each point in the feature map at the network initialization and buffer it in the memory to avoid redundant calculations. If we do not calculate it in the beginning, we will need to calculate it every time we forward the network.

The source code of PointGenerator is in loc_generator.py.

QingLuogwj commented 2 years ago

Thank you for your answer. I would like to ask another question, what is the Regression Range?Such as [0,4),[4,8)...[64,+∞).

tzzcl commented 2 years ago

Hi, since we have a multi-layer feature pyramid for ActionFormer, each point in the feature pyramid will represent a time. If the point is located in an action, we can calculate the distance from this point to the starting/ending of the action. Thus, regression range is to decide, whether the distance is in specific range of this point, if both the starting/ending distance is in the range, the point will be labeled as an positive example, otherwise it will be negative.

You can refer to Our Draft, the regression head section for details.

QingLuogwj commented 2 years ago

Thank you very much for your answer, which has benefited me a lot. Please allow me to ask you one last question, what is temporal feature resolution, and the feature stride of the paper is set as 4.

tzzcl commented 2 years ago

For the feature strides, it means that we will use a sliding window with the stride to extract the features, and the stride is 4 on THUMOS14, and 16 on ActivityNet.

For example, I3D takes 16 frames as the input and output a 1024-D feature (for a single modality like RGB). Suppose we have a total frame of 32 for a single video. If the feature stride is 4, the model will take the 1st-17th frame, get the feature. Then take the 5th-21st frame, and get the feature, and so on. If the feature stride is small, we can get dense feature sequences for each video.

In this paper, we do not extract features on our own, and we directly use features from CMCS. Thus the feature stride is 4 on THUMOS14.

QingLuogwj commented 2 years ago

Thank you for your detailed answer. I finally understand.