[Usage] Question about fine-tuning Custom_Data

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

https://llava.hliu.cc

Apache License 2.0

20.16k stars 2.22k forks source link

[Usage] Question about fine-tuning Custom_Data #1306

Open gapjialin opened 7 months ago

gapjialin commented 7 months ago

Describe the issue

Hello, there is currently a self built dataset for 80K object detection, which is used to detect the position of objects in the image. The image size is 1920x1080. When I use this data for fine-tuning, it is difficult to detect the position of objects in the image. What is the problem? The machine I am using is 8xA40, and Lora is used for fine-tuning. The other training parameters remain unchanged. My dataset example is as follows:
human: Verify if there is a presence of people in the image. gpt: There are 1 people. human: Pinpoint and describe the exact spots where each person can be found in this picture. gpt: person 1's bounding box coordinate of the region is [0.32, 0.65, 0.34, 0.75].

tctrautman commented 7 months ago

It sounds like you might need to resize your images -- take a look at "Increasing the input image resolution" improvement point in the LLaVA NeXT blog post: https://llava-vl.github.io/blog/2024-01-30-llava-next/

gapjialin commented 7 months ago

It sounds like you might need to resize your images -- take a look at "Increasing the input image resolution" improvement point in the LLaVA NeXT blog post: https://llava-vl.github.io/blog/2024-01-30-llava-next/

Thank you! I noticed that during fine-tuning, there is a description of the coordinates in the image in the dataset. Does this coordinate correspond to the image before or after compression resolution? I really want to know this question.

Linziyang1999 commented 7 months ago

It sounds like you might need to resize your images -- take a look at "Increasing the input image resolution" improvement point in the LLaVA NeXT blog post: https://llava-vl.github.io/blog/2024-01-30-llava-next/

Thank you! I noticed that during fine-tuning, there is a description of the coordinates in the image in the dataset. Does this coordinate correspond to the image before or after compression resolution? I really want to know this question.

I think it is no different because you using range 0~1， just make true you using image_radio = spatial unpad.

moaldeen commented 7 months ago

It sounds like you might need to resize your images -- take a look at "Increasing the input image resolution" improvement point in the LLaVA NeXT blog post: https://llava-vl.github.io/blog/2024-01-30-llava-next/

Thank you! I noticed that during fine-tuning, there is a description of the coordinates in the image in the dataset. Does this coordinate correspond to the image before or after compression resolution? I really want to know this question.

I want to ask you what image size or resolution did you end up doing and did it work.