Grounding + VQA multi-turn conversations format

modelscope / ms-swift

Use PEFT or Full-parameter to finetune 400+ LLMs or 100+ MLLMs. (LLM: Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, Gemma2, ...; MLLM: Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, InternVL2, MiniCPM-V-2.6, GLM4v, Xcomposer2.5, Yi-VL, DeepSeek-VL, Phi3.5-Vision, ...)

https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html

Apache License 2.0

4.43k stars 389 forks source link

Grounding + VQA multi-turn conversations format #2511

Open gaussiangit opened 3 days ago

gaussiangit commented 3 days ago

I would like to format both VQA and Grounding Object detection. How should I format the dataset for finetuning ? Should I generate json like the following ?

{"query": "How many apples ?", "response": "There are 4 apples", "images": ["abc.jpg"]} {"query": "Find Apple", "response": " [bbox coordinates]", "images": ["/co01507.jpg"], "objects": "[{\"caption\": \"apples on table\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }

JHL328 commented 3 days ago

you should put all expected information or output from the model in "response"