[Question] How to inference / captioning short video?

google-research / big_vision

Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.

Apache License 2.0

2.25k stars 147 forks source link

[Question] How to inference / captioning short video? #117

Open yoesak opened 3 months ago

yoesak commented 3 months ago

I read in the readme file, paligemma can captioning a short video, anyone can guide me to do that?

Does it extract every frames on the video? Or does the paligemma tokenizer directly support video or I need to convert my video to be a numpy array?

mitscha commented 3 months ago

PaliGemma can process a stack of frames without architecture modifications. We also released preprocessing ops to subsample videos or extract frames with a fixed stride. There are fine-tuning configs for several academic video data sets, for example MSR-VTT https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/transfers/msrvtt_cap.py

However, there are no fine-tuned checkpoints available, and some extra work is required to set up data loading for fine-tuning. Please see the video configs for details.

yoesak commented 3 months ago

Wow, great thank you for the guidance. 🙏

mvsoom commented 3 months ago

@mitscha Could you please share some example code on one could do short video captioning with a base pretrained model? I'm very interested in this.

mvsoom commented 2 months ago

Are the paligemma-mix models also finetuned for video captioning?

OguzCennet commented 3 weeks ago

PaliGemma can process a stack of frames without architecture modifications. We also released preprocessing ops to subsample videos or extract frames with a fixed stride. There are fine-tuning configs for several academic video data sets, for example MSR-VTT https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/paligemma/transfers/msrvtt_cap.py

However, there are no fine-tuned checkpoints available, and some extra work is required to set up data loading for fine-tuning. Please see the video configs for details.

Sorry, what kind of extra work is needed?