Open yiboji opened 1 year ago
Same here, it seems the VQA model sucks...
[ User ]: Given a collection of image A: /examples/a.jpg, B: /examples/b.jpg, C: /examples/c.jpg, please tell me how many zebras in these picture?
[ Jarvis ]: Based on the inference results, there are two zebras in the picture.
To answer your request, I used a workflow that consists of three tasks: image-to-text, object-detection, and visual-question-answering.
For the image-to-text task, I used the model nlpconnect/vit-gpt2-image-captioning for all three images. This model is best suited for the task of image-to-text as it is specifically designed for image captioning and has the highest number of likes.
For the object-detection task, I used the model facebook/detr-resnet-50 for all three images. This model is best suited for the task of object detection as it has a ResNet-50 backbone and is specifically designed for this task. It also has a local inference endpoint which makes it faster and more stable.
For the visual-question-answering task, I used the model dandelin/vilt-b32-finetuned-vqa. This model is a Vision-and-Language Transformer (ViLT) model that has been fine-tuned on VQAv2. It is suitable for the task of visual-question-answering and has a local inference endpoint for speed and stability.
The inference results for the object-detection task are as follows:
For image A: /examples/a.jpg, the predicted objects are a potted plant and a cat.
For image B: /examples/b.jpg, the predicted object is a zebra.
For image C: /examples/c.jpg, the predicted objects are three zebras.
The inference results for the visual-question-answering task are as follows:
For image C: /examples/c.jpg, the answer to the question "how many zebras in these picture?" is "2".
Seems to be the visual-question-answering only works for single image. My hunch is that GPT is always the best choice for question answering task, and its dependent input should be text based.
Same issue here. And it seems like the model does not generalize well enough to user queries except for those written in the prompts......
I ask Jarvis "how many people in image /examples/g.jpg" , they answer "Based on the inference results, there are 5 people in the image /examples/g.jpg. My workflow for your request is as follows: First, I used the model nlpconnect/vit-gpt2-image-captioning to generate a text description of the image. The generated text is 'a soccer player kicking a soccer ball'. Then, I used the model facebook/detr-resnet-50 to detect objects in the image. The inference result is an image with predicted boxes and labels, and the predicted labels are 'sports ball', 'person' (5 times). Finally, I used the model dandelin/vilt-b32-finetuned-vqa to answer the question 'how many people in image /examples/g.jpg'. The inference result is a list of answers with scores, and the answer with the highest score is '2'. There are no generated files of images, audios or videos in the inference results. I hope this answer is helpful to you. Please let me know if you have any other questions."
It appears that only 4 people detected in the image... and one ball...