microsoft / JARVIS

JARVIS, a system to connect LLMs with ML community. Paper: https://arxiv.org/pdf/2303.17580.pdf
MIT License
23.6k stars 1.97k forks source link

Incorrect answer? #100

Open yiboji opened 1 year ago

yiboji commented 1 year ago

I ask Jarvis "how many people in image /examples/g.jpg" , they answer "Based on the inference results, there are 5 people in the image /examples/g.jpg. My workflow for your request is as follows: First, I used the model nlpconnect/vit-gpt2-image-captioning to generate a text description of the image. The generated text is 'a soccer player kicking a soccer ball'. Then, I used the model facebook/detr-resnet-50 to detect objects in the image. The inference result is an image with predicted boxes and labels, and the predicted labels are 'sports ball', 'person' (5 times). Finally, I used the model dandelin/vilt-b32-finetuned-vqa to answer the question 'how many people in image /examples/g.jpg'. The inference result is a list of answers with scores, and the answer with the highest score is '2'. There are no generated files of images, audios or videos in the inference results. I hope this answer is helpful to you. Please let me know if you have any other questions."

It appears that only 4 people detected in the image... and one ball... 5480

LvJC commented 1 year ago

Same here, it seems the VQA model sucks...

[ User ]: Given a collection of image A: /examples/a.jpg, B: /examples/b.jpg, C: /examples/c.jpg, please tell me how many zebras in these picture?
[ Jarvis ]:  Based on the inference results, there are two zebras in the picture.

To answer your request, I used a workflow that consists of three tasks: image-to-text, object-detection, and visual-question-answering.

For the image-to-text task, I used the model nlpconnect/vit-gpt2-image-captioning for all three images. This model is best suited for the task of image-to-text as it is specifically designed for image captioning and has the highest number of likes.

For the object-detection task, I used the model facebook/detr-resnet-50 for all three images. This model is best suited for the task of object detection as it has a ResNet-50 backbone and is specifically designed for this task. It also has a local inference endpoint which makes it faster and more stable.

For the visual-question-answering task, I used the model dandelin/vilt-b32-finetuned-vqa. This model is a Vision-and-Language Transformer (ViLT) model that has been fine-tuned on VQAv2. It is suitable for the task of visual-question-answering and has a local inference endpoint for speed and stability.

The inference results for the object-detection task are as follows:

For image A: /examples/a.jpg, the predicted objects are a potted plant and a cat.

For image B: /examples/b.jpg, the predicted object is a zebra.

For image C: /examples/c.jpg, the predicted objects are three zebras.

The inference results for the visual-question-answering task are as follows:

For image C: /examples/c.jpg, the answer to the question "how many zebras in these picture?" is "2".
yiboji commented 1 year ago

Seems to be the visual-question-answering only works for single image. My hunch is that GPT is always the best choice for question answering task, and its dependent input should be text based.

YellowPig-zp commented 1 year ago

Same issue here. And it seems like the model does not generalize well enough to user queries except for those written in the prompts......