unum-cloud / uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
https://unum-cloud.github.io/uform/
Apache License 2.0
1.03k stars 62 forks source link

Caption for Driver's License is incorrect #61

Closed rxjx closed 9 months ago

rxjx commented 9 months ago

If I run the example captioning code on the first image at https://en.wikipedia.org/wiki/Driver's_licenses_in_the_United_States with max_new_tokens=1024, I get results like: 'A woman in a red jacket stands in front of a map, posing for a picture with a passport in her hand. The passport is on the left side of the image, and the woman is in the center. The map is in the background, providing context for the location.<|im_end|>' OR "The image features a postage stamp with a woman's face on it, depicting a woman in a red jacket. The stamp is from the United States and has a denomination of $10. The woman's face is prominently displayed on the stamp, making it a unique and eye-catching design. The stamp is placed in the center of the image, taking up a significant portion of the frame.<|im_end|>"

Probably attributable to the training dataset but disappointing nevertheless. Also, is there a prompt I could use to extract all the text from an image? Or would I need to fine tune for that?

kimihailv commented 9 months ago

Hello. Hallucination is still major problem which we try to solve. Concerning text extraction, our models are not suitable for OCR tasks for now