haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.16k stars 2.22k forks source link

[Question] Recognition Error #1094

Open nhw649 opened 9 months ago

nhw649 commented 9 months ago

Question

Hello, I found that it, like CLIP, focuses on a very coarse granularity. For example, an image cropped from the GT box of the COCO dataset is recognized by LLaVA as "Tennis racket", but ground truth is "person". How to solve? Maybe my prompt design is not good enough. image

haotian-liu commented 9 months ago

Hi, thank you for the interest. This is a known weakness (especially when you give it 1000 classes from ImageNet) of LLaVA like models. However, with a small tweek in prompt, it seems to work much better in LLaVA-1.6.

Please select a category from the list. ['Person', 'Bicycle', 'Car', 'Motorcycle', 'Airplane', 'Bus', 'Train', 'Truck', 'Boat', 'Traffic light', 'Fire hydrant', 'Stop sign', 'Parking meter', 'Bench', 'Bird', 'Cat', 'Dog', 'Horse', 'Sheep', 'Cow', 'Elephant', 'Bear', 'Zebra', 'Giraffe', 'Backpack', 'Umbrella', 'Handbag', 'Tie', 'Suitcase', 'Frisbee', 'Skis', 'Snowboard', 'Sports ball', 'Kite', 'Baseball bat', 'Baseball glove', 'Skateboard', 'Surfboard', 'Tennis racket', 'Bottle', 'Wine glass', 'Cup', 'Fork', 'Knife', 'Spoon', 'Bowl', 'Banana', 'Apple', 'Sandwich', 'Orange', 'Broccoli', 'Carrot', 'Hot dog', 'Pizza', 'Donut', 'Cake', 'Chair', 'Couch', 'Potted plant', 'Bed', 'Dining table', 'Toilet', 'TV', 'Laptop', 'Mouse', 'Remote', 'Keyboard', 'Cell phone', 'Microwave', 'Oven', 'Toaster', 'Sink', 'Refrigerator', 'Book', 'Clock', 'Vase', 'Scissors', 'Teddy bear', 'Hair drier', 'Toothbrush'] Answer the question with the category directly without explanation.

image image image image
nhw649 commented 9 months ago

oh, thanks.