Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️
If I run the example captioning code on the first image at https://en.wikipedia.org/wiki/Driver's_licenses_in_the_United_States with max_new_tokens=1024, I get results like:
'A woman in a red jacket stands in front of a map, posing for a picture with a passport in her hand. The passport is on the left side of the image, and the woman is in the center. The map is in the background, providing context for the location.<|im_end|>'
OR
"The image features a postage stamp with a woman's face on it, depicting a woman in a red jacket. The stamp is from the United States and has a denomination of $10. The woman's face is prominently displayed on the stamp, making it a unique and eye-catching design. The stamp is placed in the center of the image, taking up a significant portion of the frame.<|im_end|>"
Probably attributable to the training dataset but disappointing nevertheless. Also, is there a prompt I could use to extract all the text from an image? Or would I need to fine tune for that?
If I run the example captioning code on the first image at https://en.wikipedia.org/wiki/Driver's_licenses_in_the_United_States with max_new_tokens=1024, I get results like: 'A woman in a red jacket stands in front of a map, posing for a picture with a passport in her hand. The passport is on the left side of the image, and the woman is in the center. The map is in the background, providing context for the location.<|im_end|>' OR "The image features a postage stamp with a woman's face on it, depicting a woman in a red jacket. The stamp is from the United States and has a denomination of $10. The woman's face is prominently displayed on the stamp, making it a unique and eye-catching design. The stamp is placed in the center of the image, taking up a significant portion of the frame.<|im_end|>"
Probably attributable to the training dataset but disappointing nevertheless. Also, is there a prompt I could use to extract all the text from an image? Or would I need to fine tune for that?