Comparison between frame-wise and clip-wise feature extraction in terms of computation time

Hi Vladimir!

Thank you so much for your efforts on this project! This has been really helpful for my research. :)

I have a question not related to this repo, but directly related to the project. I apologize in advance if this is the wrong place to ask.

For a paper I'm currently working on, I wish to make a comparative statement on frame-wise (CLIP, ResNet) VS clip-wise feature extraction (C3D, S3D, S3D) in terms of training and inference time. My intuition and some quick experiments suggest that frame-wise feature extraction is faster for both training and inference, but so far I couldn't find any references to support this. So I just thought I'll check with you as well. Have you come across any references that compare the computation time between frame-wise and clip-wise feature extraction?

Best, Noga

v-iashin / video_features

Comparison between frame-wise and clip-wise feature extraction in terms of computation time #103