Index Loading not working if dumped from different process

sobelio / llm-chain

`llm-chain` is a powerful rust crate for building chains in large language models allowing you to summarise text and complete complex tasks

https://llm-chain.xyz

MIT License

1.3k stars 128 forks source link

Index Loading not working if dumped from different process #154

Open KCaverly opened 1 year ago

KCaverly commented 1 year ago

Running the example works fine if you both generate, dump then load the index. However, if you generate and dump the index, you cannot reload the index in a new process, without adding the documents again. Running a query on a loaded index, leads to missing document errors.

Do you have to add_documents again after load? As I believe the 'add_documents' method, generates the embeddings itself, does this not lead to redundant calls to openai in which you have to regenerate the embeddings on load a second time?

Pablo1785 commented 1 year ago

Hi, thank you for filing the issue.

Obviously, I cannot see your code, but I'm assuming you are using the defaults from the example. The issue here seems to me is that we do not have a persistent DocumentStore implementation - we only have an InMemoryDocumentStore. So what effectively happens is that the HNSW index itself (so the embedded vectors) does get saved to the file, but the documents (so the contents) only ever lived in your original process and are not saved to file when dumping the index.

For now if you want persistence quickly I would recommend the QdrantVectorStore

KCaverly commented 1 year ago

Thanks for the response.

Given this, would it be possible (and would you be open to a pr) in which we can load the documents to a vector store without embedding? Assuming of course that the vector store already has the embeddings/index from the loaded .hnsw files.

Hnsw indexes are great for POC and lightweight without exploring full vector db solutions, so Im hesitant to move to qdrant at this time.

Pablo1785 commented 1 year ago

Sure, we are always open to new PRs and this is definitely a blindspot.

I'm not entirely sure if having a method that would simply add a doc to the vectorstore without embedding would be sound. This could lead to invalid states if the user misuses the API.

I think it might be better to have some dump_docs()/load_docs() type of methods implemented specifically for HNSWVectorStore with InMemoryDocstore. This is my first idea but this is definitely not the only solution.

I think the most important thing is that it should never be possible to have a VectorStore with a Document without an embedded vector, or vice versa - a vector without a corresponding Document.