simon987 / sist2

Lightning-fast file system indexer and search tool
GNU General Public License v3.0
843 stars 55 forks source link

Support RAG with #452

Open prdubois opened 8 months ago

prdubois commented 8 months ago

Similar to #425, it would be nice to support OpenAI embeddings to enable the RAG use case.

simon987 commented 8 months ago

Could you add more details? Or links? I imagine this can't be self-hosted and the user would have to put in their OpenAI API key? If so, this means that anyone with access to the sist2 app would be able to drain your account (the query embeddings have to be computed client-side)

prdubois commented 8 months ago

I'm new to ElasticSearch and RAG so I might not have the best explanations but I'm basically looking to achieve what is described here: https://cookbook.openai.com/examples/vector_databases/elasticsearch/elasticsearch-retrieval-augmented-generation

A more complete example project is available here as well: https://github.com/elastic/elasticsearch-labs/tree/main/example-apps/chatbot-rag-app

But these two examples are using very simple data ingesting pipelines. I would like to replace that with sist2. Ideally, the embeddings and document splitting into chunks would happen at indexing time (which would require sist2 changes). Perhaps it's also possible to reprocess the index created by sist2 and add the embeddings but I have not managed to do it yet.

Reprocessing the sist2 index would be my preferred approach because I envision having indexes from other data sources (e.g. Jira) so this would be a more robust approach. But still, perhaps adding RAG capabilities to sist2 would be worthwhile, which is why I created this feature request.

dpieski commented 8 months ago

FWIW, I am testing using the SIST2 index after scan for that. Since it is just a sqlite db, it is easy to work with in Python and all the document content has been pulled out.

When sending to the vector db, there are so many optimizations / tweaks that you can do in how the text is split up (lengths, overlaps, section breaking, etc.).

Then you also have loads of vector dbs to choose from, each with pros/cons for each situation.

Also, in an effort not to rely on third-party services, I am using local models, not OAI.