Open Rajaniraiyn opened 1 month ago
Sounds like a good enhancement - especially useful for indexing blender/photoshop/visual based tasks.
Highly encourage you to take a crack at implementing it!
Sounds great! I’ll try it out and reach out if I have any questions.
microsoft Florence 2 might be a good option: https://huggingface.co/microsoft/Florence-2-large other tools are using https://moondream.ai/
Yeah I've played with moondream and (when I did) it performed quite poorly on screenshots. I had a short interaction with the creator and it sounded like he was considering trying to tackle screenshots, but the project was currently focused on scenes (photographs etc)
I've been keeping an eye out... Closed models (OpenAI / anthropic) are able to look at a screenshot and build an html page to some degree, which tells me they have a pretty good understanding of screenshots and would perform well.
Maybe a fine tune in screenshots of moondream using a larger model would be possible.
A few hours ago hf released an article on how to finetune Florence. https://x.com/mervenoyann/status/1805265942487675139
Is there a plan to incorporate image embeddings along with OCR and metadata-based retrieval? Utilizing the CLIP model from Candle to generate image embeddings could provide clearer context and improve the accuracy of xrem’s results. If performance is a concern, downscaling images before embedding could be a viable solution.