[Question] Recognition Error

nhw649 commented 9 months ago

Question

Hello, I found that it, like CLIP, focuses on a very coarse granularity. For example, an image cropped from the GT box of the COCO dataset is recognized by LLaVA as "Tennis racket", but ground truth is "person". How to solve? Maybe my prompt design is not good enough.

haotian-liu commented 9 months ago

Hi, thank you for the interest. This is a known weakness (especially when you give it 1000 classes from ImageNet) of LLaVA like models. However, with a small tweek in prompt, it seems to work much better in LLaVA-1.6.

Please select a category from the list. ['Person', 'Bicycle', 'Car', 'Motorcycle', 'Airplane', 'Bus', 'Train', 'Truck', 'Boat', 'Traffic light', 'Fire hydrant', 'Stop sign', 'Parking meter', 'Bench', 'Bird', 'Cat', 'Dog', 'Horse', 'Sheep', 'Cow', 'Elephant', 'Bear', 'Zebra', 'Giraffe', 'Backpack', 'Umbrella', 'Handbag', 'Tie', 'Suitcase', 'Frisbee', 'Skis', 'Snowboard', 'Sports ball', 'Kite', 'Baseball bat', 'Baseball glove', 'Skateboard', 'Surfboard', 'Tennis racket', 'Bottle', 'Wine glass', 'Cup', 'Fork', 'Knife', 'Spoon', 'Bowl', 'Banana', 'Apple', 'Sandwich', 'Orange', 'Broccoli', 'Carrot', 'Hot dog', 'Pizza', 'Donut', 'Cake', 'Chair', 'Couch', 'Potted plant', 'Bed', 'Dining table', 'Toilet', 'TV', 'Laptop', 'Mouse', 'Remote', 'Keyboard', 'Cell phone', 'Microwave', 'Oven', 'Toaster', 'Sink', 'Refrigerator', 'Book', 'Clock', 'Vase', 'Scissors', 'Teddy bear', 'Hair drier', 'Toothbrush'] Answer the question with the category directly without explanation.

nhw649 commented 9 months ago

oh, thanks.

haotian-liu / LLaVA

[Question] Recognition Error #1094

Question