happyharrycn / actionformer_release

Code release for ActionFormer (ECCV 2022)
MIT License
419 stars 77 forks source link

Question about such strong I3D feature #60

Closed yangmin666 closed 1 year ago

yangmin666 commented 1 year ago

I3D feature you provided is much stronger than public I3D feature(eg:tadtr,rtd-net use them). Also, I3D feature you provided is 4x downsampled while public feature is 8x downsampled. How did you generate this feature? Another question, do you have a result of I3D without flow branch? Thanks!

tzzcl commented 1 year ago

for the feature extraction part, you can refer to the original CMCS repo for more details. For RGB-only results, you can just reduce the feature channel to 1024 (we directly concatenate the RGB/Flow features) and re-run the experiments.

happyharrycn commented 1 year ago

I am not totally sure what you referred to as stronger I3D features. The features are extracted from the same I3D pre-trained on Kinetics, also used by previous works (e.g., CMCS). Yes, the feature is extracted every 4 frames, as we described in the paper. If one takes the features by every 8 frames and re-trains our model, the performance is roughly the same (see our ablation in Table C Appendix).

As Chenlin commented, taking the first half of the feature channels will give you the RGB features.

happyharrycn commented 1 year ago

Closed due to inactivity. Let us know if you have further questions.

csrhddlam commented 1 year ago

for the feature extraction part, you can refer to the original CMCS repo for more details. For RGB-only results, you can just reduce the feature channel to 1024 (we directly concatenate the RGB/Flow features) and re-run the experiments.

According to the suggestion, I tried only using the first N channels in the i3d features provided on THUMOS, and run exactly the command provided. Here is the mAP I got: first 512 channels, 55.1; first 1024 channels (i.e. RGB only), 51.7; first 1536 channels, 64.2; first 2048 channels (all of the channels), 66.5.

Does it make sense? The last 1024 dimensions (i.e. flow features) seem to contribute A LOT.

happyharrycn commented 1 year ago

We did not explore using RGB features alone on THUMOS14, and I won't comment on these numbers. There is a caveat though. When feature channels are reduced, the model size (e.g., embedding / FPN / head dims) often has to be shrunk accordingly (see our config in anet_tsp.yaml) in order to achieve the best performance.

I would suggest the setting of taking the first 1024 channels (RGB features), and reducing the model size by half (setting embedding / FPN / head dims to 256). Hyperparameter for train might need to be tweaked if overfitting / underfitting is observed.

tzzcl commented 1 year ago

In addition to Yin's answers, I've verfied that the first 1024 channels are the RGB feature and the remaining 1024 channels are the flow feature (average of the original 10-crop features).