Thank you for your excellent work! By the way I want to know about clip_len and frame_interval for Kinetics. In Appendix A.1, "We evaluate the model on 8, 16, 32 frames and the sampling interval is 16, 8, 4, respectively." Does this mean for kinetics400/700, the data pipeline (train, val, test) should be the same? For example, in configs/recognition/vit/vit_imagenet_k400.py, the config of data pipeline keeps the same as the paper mentioned.
i.e., clip_len=8, frame_interval=16 for train/val/test pipeline, which is the same as the paper mentioned.
But, for CLIP pretrained, the configs are confused.
vitclip_base_k400, clip_len=32, frame_interval=16 for train pipeline, while clip_len=32, frame_interval=8 for val/test pipeline. However, if clip_len=32, frame_interval should be 4?
Thank you for your excellent work! By the way I want to know about
clip_len
andframe_interval
for Kinetics. In Appendix A.1, "We evaluate the model on 8, 16, 32 frames and the sampling interval is 16, 8, 4, respectively." Does this mean for kinetics400/700, the data pipeline (train, val, test) should be the same? For example, inconfigs/recognition/vit/vit_imagenet_k400.py
, the config of data pipeline keeps the same as the paper mentioned.i.e.,
clip_len=8, frame_interval=16
for train/val/test pipeline, which is the same as the paper mentioned.https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vit_imagenet_k400.py#L19-L21 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vit_imagenet_k400.py#L32-L39 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vit_imagenet_k400.py#L49-L56
But, for CLIP pretrained, the configs are confused.
vitclip_base_k400
,clip_len=32, frame_interval=16
for train pipeline, whileclip_len=32, frame_interval=8
for val/test pipeline. However, ifclip_len=32
,frame_interval
should be 4?https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_base_k400.py#L19-L21 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_base_k400.py#L32-L39 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_base_k400.py#L49-L56
vitclip_large_k400
,clip_len=16, frame_interval=16
for train/val/test pipeline. However, ifclip_len=16
,frame_interval
should be 8?https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_large_k400.py#L19-L21 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_large_k400.py#L32-L39 https://github.com/taoyang1122/adapt-image-models/blob/392647ea000c9bda8e1123c4efb6d61cd398025c/configs/recognition/vit/vitclip_large_k400.py#L49-L56
Thank you.