haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.03k stars 2.21k forks source link

[Question] Clarification on JSON Format for Object Detection Task #1512

Open makemecker opened 5 months ago

makemecker commented 5 months ago

Question

Hi,

Thank you for the comprehensive guidelines on fine-tuning LLaVA on custom datasets! The provided example for generating tag-style captions for Stable Diffusion has been very helpful.

I am currently working on fine-tuning LLaVA for object detection tasks and have created the following JSON format:

"image": "part-000001/997bb945-628d-4724-b370-b84de974a19f.jpg",
"conversations": [
  {
    "from": "human",
    "value": "Find objects of the classes vehicle, human, swimming pool, landfill, building, tree, satellite antenna and street light in the image. Provide the coordinates of the bounding boxes in the format class: [bounding box coordinates]."
  },
  {
    "from": "gpt",
    "value": "vehicle: [0.85, 0.139, 0.024, 0.049], vehicle: [0.603, 0.158, 0.029, 0.055], human: [0.945, 0.018, 0.014, 0.036], human: [0.83, 0.028, 0.017, 0.041], human: [0.85, 0.123, 0.024, 0.067]."
  }
]

Could you please confirm if this JSON structure is correct for fine-tuning LLaVA on object detection tasks? Specifically, I would like to know:

  1. Is the structure of the JSON file appropriate for object detection?
  2. Are the metadata fields correctly defined?
  3. Is the format for bounding box coordinates accurate?

Any additional insights or corrections would be greatly appreciated.

Thank you for your assistance!

kangISU commented 2 months ago

How's the result of this object detection task?

makemecker commented 2 months ago

How's the result of this object detection task?

I used the information from this guide. However, it doesn't specifically cover object detection tasks, unfortunately.