Open yoesak opened 3 months ago
PaliGemma can process a stack of frames without architecture modifications. We also released preprocessing ops to subsample videos or extract frames with a fixed stride. There are fine-tuning configs for several academic video data sets, for example MSR-VTT https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/transfers/msrvtt_cap.py
However, there are no fine-tuned checkpoints available, and some extra work is required to set up data loading for fine-tuning. Please see the video configs for details.
Wow, great thank you for the guidance. 🙏
@mitscha Could you please share some example code on one could do short video captioning with a base pretrained model? I'm very interested in this.
Are the paligemma-mix models also finetuned for video captioning?
PaliGemma can process a stack of frames without architecture modifications. We also released preprocessing ops to subsample videos or extract frames with a fixed stride. There are fine-tuning configs for several academic video data sets, for example MSR-VTT https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/transfers/msrvtt_cap.py
However, there are no fine-tuned checkpoints available, and some extra work is required to set up data loading for fine-tuning. Please see the video configs for details.
Sorry, what kind of extra work is needed?
I read in the readme file, paligemma can captioning a short video, anyone can guide me to do that?
Does it extract every frames on the video? Or does the paligemma tokenizer directly support video or I need to convert my video to be a numpy array?