InternVL2 customized fine-tuning for video-to-text

modelscope / ms-swift

Use PEFT or Full-parameter to finetune 350+ LLMs or 90+ MLLMs. (Qwen2.5, GLM4v, Internlm2.5, Yi, Llama3.1, Llava-Video, Internvl2, MiniCPM-V-2.6, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)

https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html

Apache License 2.0

3.49k stars 299 forks source link

InternVL2 customized fine-tuning for video-to-text #1418

Closed BillChan226 closed 2 months ago

BillChan226 commented 2 months ago

Hi! Great works! I'm wondering if swift can support fine-tuning for InternVL2 on customized video-to-text dataset soon? Thanks!

hjh0119 commented 2 months ago

you can specify the abs video path by key videos as follows

{"query": "Describe this video in detail. Don't repeat", "response": "xxxxxxxxx", "history": [], "videos": ["video_path"]}

BillChan226 commented 2 months ago

Hi Thanks for the reply! It works! However I'm wondering how to set the number of frames sampled for each video when ft internvlw?

hjh0119 commented 2 months ago

modify num_segments in https://github.com/modelscope/swift/blob/main/swift/llm/utils/vision_utils.py#L116