taoyang1122 / adapt-image-models

[ICLR'23] AIM: Adapting Image Models for Efficient Video Action Recognition
Apache License 2.0
278 stars 21 forks source link

About input frames and sampling interval #6

Closed BinhuiXie closed 1 year ago

BinhuiXie commented 1 year ago

Thank you for your excellent work! By the way I want to know about clip_len and frame_interval for Kinetics. In Appendix A.1, "We evaluate the model on 8, 16, 32 frames and the sampling interval is 16, 8, 4, respectively." Does this mean for kinetics400/700, the data pipeline (train, val, test) should be the same? For example, in configs/recognition/vit/vit_imagenet_k400.py, the config of data pipeline keeps the same as the paper mentioned.

i.e., clip_len=8, frame_interval=16 for train/val/test pipeline, which is the same as the paper mentioned.

https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vit_imagenet_k400.py#L19-L21 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vit_imagenet_k400.py#L32-L39 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vit_imagenet_k400.py#L49-L56

But, for CLIP pretrained, the configs are confused.

  1. vitclip_base_k400, clip_len=32, frame_interval=16 for train pipeline, while clip_len=32, frame_interval=8 for val/test pipeline. However, if clip_len=32, frame_interval should be 4?

https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_base_k400.py#L19-L21 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_base_k400.py#L32-L39 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_base_k400.py#L49-L56

  1. vitclip_large_k400, clip_len=16, frame_interval=16 for train/val/test pipeline. However, if clip_len=16, frame_interval should be 8?

https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_large_k400.py#L19-L21 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_large_k400.py#L32-L39 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_large_k400.py#L49-L56

Thank you.

taoyang1122 commented 1 year ago

Hi @BinhuiXie , thanks for your interest in our work. You can safely follow the settings descripbed in the paper. I will update the codes.