Open dmmSJTU opened 1 year ago
Hi, you can download imagenet pre-trained ViT from timm.
Get it. Thank you! What does ” Views = #frames × #temporal × #spatial “ mean? Does it mean "clip_len, frame_interval, num_clips" in the training?
We usually do multi-view testing. So that is 'num_frame in one clip' x 'number of clip sampled by temporal crop' x 'number of clips sampled by spatial crop'.
ok, thank you. Why will the running environment in the server be affected after modifying the https://github.com/taoyang1122/adapt-image-models/blob/main/mmaction/datasets/base.py? What if I want to modify base.py without affecting the running environment?
Hi, I am not sure what do you mean by the environment is affected. Could you please explain in more details?
We usually do multi-view testing. So that is 'num_frame in one clip' x 'number of clip sampled by temporal crop' x 'number of clips sampled by spatial crop'.
Could you please tell me what these specific entries mean, #frames is the number of frames in a single clip, so how should I understand the 'number of clip sampled by temporal crop' and 'number of clips sampled by spatial crop' and exactly how they are obtained? Additionally, in tables of the paper some places are frames x 1 x 3, and some places are frames x 3 x 1. Why would there be such a difference, please?
The way they are obtained may be different in different methods/datasets. For example, three temporal crops can be obtained by sampling from the first part, middle part, and last part of the video. Three spatial crops cound be obtained by cropping upper-left corner, center, and lower-right corner. The final prediction is the ensemble of different crops (i.e., views). Then the numbers mean frames x num_temporal_crops x num_spatial_crops during testing.
Hello, I really appreciate your work. May I ask where can I download the pretrained model for vit on Imagenet?