I would like to format both VQA and Grounding Object detection. How should I format the dataset for finetuning ? Should I generate json like the following ?
{"query": "How many apples ?", "response": "There are 4 apples", "images": ["abc.jpg"]}
{"query": "Find Apple", "response": " [bbox coordinates]", "images": ["/co01507.jpg"], "objects": "[{\"caption\": \"apples on table\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }
I would like to format both VQA and Grounding Object detection. How should I format the dataset for finetuning ? Should I generate json like the following ?
{"query": "How many apples ?", "response": "There are 4 apples", "images": ["abc.jpg"]} {"query": "Find Apple", "response": " [bbox coordinates]", "images": ["/co01507.jpg"], "objects": "[{\"caption\": \"apples on table\", \"bbox\": [138, 136, 235, 359], \"bbox_type\": \"real\", \"image\": 0}]" }