taoyang1122 / adapt-image-models

[ICLR'23] AIM: Adapting Image Models for Efficient Video Action Recognition
Apache License 2.0
276 stars 21 forks source link

pretrained model #19

Open dmmSJTU opened 1 year ago

dmmSJTU commented 1 year ago

Hello, I really appreciate your work. May I ask where can I download the pretrained model for vit on Imagenet?

taoyang1122 commented 1 year ago

Hi, you can download imagenet pre-trained ViT from timm.

dmmSJTU commented 1 year ago

Get it. Thank you! What does ” Views = #frames × #temporal × #spatial “ mean? Does it mean "clip_len, frame_interval, num_clips" in the training?

taoyang1122 commented 1 year ago

We usually do multi-view testing. So that is 'num_frame in one clip' x 'number of clip sampled by temporal crop' x 'number of clips sampled by spatial crop'.

dmmSJTU commented 1 year ago

ok, thank you. Why will the running environment in the server be affected after modifying the https://github.com/taoyang1122/adapt-image-models/blob/main/mmaction/datasets/base.py? What if I want to modify base.py without affecting the running environment?

taoyang1122 commented 1 year ago

Hi, I am not sure what do you mean by the environment is affected. Could you please explain in more details?

DavidYanAnDe commented 1 year ago

We usually do multi-view testing. So that is 'num_frame in one clip' x 'number of clip sampled by temporal crop' x 'number of clips sampled by spatial crop'.

Could you please tell me what these specific entries mean, #frames is the number of frames in a single clip, so how should I understand the 'number of clip sampled by temporal crop' and 'number of clips sampled by spatial crop' and exactly how they are obtained? Additionally, in tables of the paper some places are frames x 1 x 3, and some places are frames x 3 x 1. Why would there be such a difference, please?

taoyang1122 commented 1 year ago

The way they are obtained may be different in different methods/datasets. For example, three temporal crops can be obtained by sampling from the first part, middle part, and last part of the video. Three spatial crops cound be obtained by cropping upper-left corner, center, and lower-right corner. The final prediction is the ensemble of different crops (i.e., views). Then the numbers mean frames x num_temporal_crops x num_spatial_crops during testing.