I have a few questions about your training process.
(1) Did you fix the number of frames (clips) as 24? Because in every config file, clip length is consistently 24.
Does it impose that any number bigger or smaller than 24 doesn't perform as well as 24?
(2) In the training step, do you shuffle the order of frames (clips)? I have a feeling that it is not proper to shuffle the frames because the frame-related attention parts learn the order of frames too?
Hi @HyeonHo99, thank you for your interest in our work. Below are some comments regarding your questions:
Here we set the number of frames to 24 that the code can be run on a 24GB GPU. Feel free to explore more choices of video length.
In our experiments, we have not shuffled the order of frames. If you want to try video-image co-training, you can disable the temporal components when training on images.
Hi, thank you first for amazing works.
I have a few questions about your training process. (1) Did you fix the number of frames (clips) as 24? Because in every config file, clip length is consistently 24. Does it impose that any number bigger or smaller than 24 doesn't perform as well as 24?
(2) In the training step, do you shuffle the order of frames (clips)? I have a feeling that it is not proper to shuffle the frames because the frame-related attention parts learn the order of frames too?
Thank you again.