happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)
MIT License
415 stars 77 forks source link

A question about Temporal Feature Resolution. #134

Closed miaolin968 closed 4 months ago

miaolin968 commented 4 months ago

Hi~, as you described in Appendix A, temporal feature resolution doesn't effects the performence of Actionformer. I am trying to bridge the gap between different resolutions. How did you obtain the characteristics of different resolutions? I got the feature with resolution 8 at ContextLoc. Can you provide the feature with resolution 16? by the way, The quality of my self-generated features is very poor compared to the official features. Can you share the settings of the hyperparameters?

miaolin968 commented 4 months ago

Looking forward to your reply!(・ω< )★

happyharrycn commented 4 months ago

I am not sure I understand your questions. For example, what do you refer to as "characteristics of different resolutions?" The (temporal) resolution is specified during the feature extraction. If you mean a temporal stride of 8 by "resolution 8," features with temporal stride of 16 can be produced by subsampling the features with temporal stride of 8. For extracting video features, we point to this script from the VideoMAEv2 repo.

miaolin968 commented 4 months ago

或许我没有理解您论文的本意,关于您论文Appendix A中“temporal feature resolution”部分中的resolution“stride=16”是否指的是在提取特征时的使用的光流图是每隔16帧计算的?

如果是这样的话,似乎不能使用subsampling,因为对于光流特征,每4帧计算与每16帧计算得到的光流是不同的。

tzzcl commented 4 months ago

The temporal stride means the stride on the raw frame level. Suppose we have 32 frames and the model needs 16 frames to output a feature vector, the frames will be like:

|0 -- 4 -- 8  -- 12 -- 16 -- 20 -- 24 -- 28 -- 32|

If the strides equals to 4, the model will take the frames from 0-16, 4-20, etc to generate feature vectors. If the strides equals to 8, the model will take the frames from 0-16, 8-24, 16-32 etc to generate feature vectors. If the strides equals to 16, the model will take the frames from 0-16, 16-32 etc to generate feature vectors.

Thus we can directly downsample features here.

miaolin968 commented 4 months ago

对于optical flow,是由连续的2帧来计算得到的吗?例如从0-32帧获得32帧光流图,然后按照您上面所说来送入I3D网络提取特征?

miaolin968 commented 4 months ago

例如test_0000004视频有1012帧,我们先计算对应的1012帧光流,然后使用您所说的方式提取I3D特征?

The temporal stride means the stride on the raw frame level. Suppose we have 32 frames and the model needs 16 frames to output a feature vector, the frames will be like:

|0 -- 4 -- 8 -- 12 -- 16 -- 20 -- 24 -- 28 -- 32| If the strides equals to 4, the model will take the frames from 0-16, 4-20, etc to generate feature vectors. If the strides equals to 8, the model will take the frames from 0-16, 8-24, 16-32 etc to generate feature vectors. If the strides equals to 16, the model will take the frames from 0-16, 16-32 etc to generate feature vectors.

Thus we can directly downsample features here.

tzzcl commented 4 months ago

例如test_0000004视频有1012帧,我们先计算对应的1012帧光流,然后使用您所说的方式提取I3D特征?

The temporal stride means the stride on the raw frame level. Suppose we have 32 frames and the model needs 16 frames to output a feature vector, the frames will be like: |0 -- 4 -- 8 -- 12 -- 16 -- 20 -- 24 -- 28 -- 32| If the strides equals to 4, the model will take the frames from 0-16, 4-20, etc to generate feature vectors. If the strides equals to 8, the model will take the frames from 0-16, 8-24, 16-32 etc to generate feature vectors. If the strides equals to 16, the model will take the frames from 0-16, 16-32 etc to generate feature vectors. Thus we can directly downsample features here.

Yes

miaolin968 commented 4 months ago

Thank you for your patient answer, my problem is solved