Multi-Modal Support for Enhanced Retrieval

jasonjmcghee / xrem

(Cross-Platform) An open source approach to locally record and enable searching everything you view on any computer.

239 stars 14 forks source link

Multi-Modal Support for Enhanced Retrieval #12

Open Rajaniraiyn opened 1 month ago

Rajaniraiyn commented 1 month ago

Is there a plan to incorporate image embeddings along with OCR and metadata-based retrieval? Utilizing the CLIP model from Candle to generate image embeddings could provide clearer context and improve the accuracy of xrem’s results. If performance is a concern, downscaling images before embedding could be a viable solution.

jasonjmcghee commented 1 month ago

Sounds like a good enhancement - especially useful for indexing blender/photoshop/visual based tasks.

Highly encourage you to take a crack at implementing it!

Rajaniraiyn commented 1 month ago

Sounds great! I’ll try it out and reach out if I have any questions.

Thawab8 commented 2 weeks ago

microsoft Florence 2 might be a good option: https://huggingface.co/microsoft/Florence-2-large other tools are using https://moondream.ai/

jasonjmcghee commented 2 weeks ago

Yeah I've played with moondream and (when I did) it performed quite poorly on screenshots. I had a short interaction with the creator and it sounded like he was considering trying to tackle screenshots, but the project was currently focused on scenes (photographs etc)

I've been keeping an eye out... Closed models (OpenAI / anthropic) are able to look at a screenshot and build an html page to some degree, which tells me they have a pretty good understanding of screenshots and would perform well.

Maybe a fine tune in screenshots of moondream using a larger model would be possible.

Thawab8 commented 2 weeks ago

A few hours ago hf released an article on how to finetune Florence. https://x.com/mervenoyann/status/1805265942487675139