manticoresoftware / manticoresearch

Easy to use open source fast database for search | Good alternative to Elasticsearch now | Drop-in replacement for E in the ELK soon
https://manticoresearch.com
GNU General Public License v3.0
8.93k stars 495 forks source link

Automatic embeddings generation #1778

Open sanikolaev opened 8 months ago

sanikolaev commented 8 months ago

As discussed in https://forum.manticoresearch.com/t/search-for-similar-documents/1799/2 https://forum.manticoresearch.com/t/search-for-similar-documents/1799 , it's quite complicated to generate embeddings outside of Manticore Search. It would be great if Manticore could do it automatically. It's worth checking if Manticore Search can be integrated with https://github.com/microsoft/onnxruntime/ or another similar library.


Checklist

To be completed by the assignee. Check off tasks that have been completed or are not applicable.

- [x] Task estimated - [x] Bug reproduced - [ ] Specification created, reviewed and approved - [ ] Implementation completed - [ ] Tests developed - [ ] Documentation updated - [ ] Documentation proofread - [ ] Changelog updated - [ ] OpenAPI YAML updated and issue created to rebuild clients
virtadpt commented 8 months ago

So much this. This is what I'm working on right now.

nickchomey commented 8 months ago

This would be very cool to have built-in.

Though I'd just like to put in a good word for txtai, a Python library/application for all sorts of nlp, vector database etc... https://neuml.github.io/txtai/

It's very easy to use and quite flexible, whereas some of its "competitors" are much more opinionated about how things need to be done.

I'll likely have it set up as an nlp/embeddings processors/server and then it's output will be stored in manticore and some others data stores.

jhparker88 commented 8 months ago

great if Manticore could do it automatically. It's worth checking if Manticore Search can be integrated with https://github.com/microsoft/onnxruntime/ or another similar library.

You might want to look at typesense (similar search engine provider) on how they have integrated embedding generation [1] with their search including model selection etc.

[1] https://typesense.org/docs/0.25.0/api/vector-search.html#option-b-auto-embedding-generation-within-typesense

sanikolaev commented 5 months ago

@donhardman as discussed on today's call, pls write down the suggested architecture of this functionality.

donhardman commented 5 months ago

As an idea, I think it's a good approach to go for a .so library written in Rust while reusing what we've already learned by introducing it in our GitHub issue search demo and calling this function from the C code.

In that case, we'll have a function that will be used and utilized by Rust and shipped in the same way we do with columnar, and the C code of the daemon will call this function when needed to generate auto embeddings.

It sounds easy to implement since we already have everything we need.

The goal is to adapt the CandleML framework from HuggingFace, making it flexible and customizable so users can choose the best model for their needs.

sanikolaev commented 5 months ago

The next sub-task is to prepare a syntax specification for the task.

Also, the related issue is https://github.com/manticoresoftware/manticoresearch/issues/2074

sanikolaev commented 4 months ago

Blocked by https://github.com/manticoresoftware/manticoresearch/issues/2074