langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
93.96k stars 15.14k forks source link

Remove duplication when creating and updating FAISS Vecstore #3896

Closed sunson2k closed 10 months ago

sunson2k commented 1 year ago

The FAISS.add_texts and FAISS.merge_from don't check duplicated document contents, and always add contents into Vecstore.

test_db = FAISS.from_texts(['text 2'], embeddings)
test_db.add_texts(['text 1', 'text 2', 'text 1'])
print(test_db.index_to_docstore_id)
test_db.docstore._dict

Note that 'text 1' and 'text 2' are both added twice with different indices.

{0: '12a6a477-db74-4d90-b843-4cd872e070a0', 1: 'a3171e0e-f12a-418f-9994-5625550de73e', 2: '543f8fcf-bf84-4d9e-a6a9-f87fda0afcc3', 3: 'ed320a75-775f-4ec2-ae0b-fef8fa8d0bfe'}
{'12a6a477-db74-4d90-b843-4cd872e070a0': Document(page_content='text 2', lookup_str='', metadata={}, lookup_index=0),
 'a3171e0e-f12a-418f-9994-5625550de73e': Document(page_content='text 1', lookup_str='', metadata={}, lookup_index=0),
 '543f8fcf-bf84-4d9e-a6a9-f87fda0afcc3': Document(page_content='text 2', lookup_str='', metadata={}, lookup_index=0),
 'ed320a75-775f-4ec2-ae0b-fef8fa8d0bfe': Document(page_content='text 1', lookup_str='', metadata={}, lookup_index=0)}

Also the embedding values are the same

np.dot(test_db.index.reconstruct(0), test_db.index.reconstruct(2))
1.0000001

Expected Behavior: Similar to database upsert, create new index if key (content or embedding) doesn't exist, otherwise update the value (document metadata in this case).

I'm pretty new to LangChain, so if I'm missing something or doing it wrong, apologies and please suggest the best practice on dealing with LangChain FAISS duplication - otherwise, hope this is useful feedback, thanks!

atisharma commented 1 year ago

I'd also like this. With the chromadb interface I can specify ids on insert which gives me some protection for duplication. However langchain just sets the id of each entry to uuid4 (random) which means you can never detect duplicates.

eyurtsev commented 1 year ago

https://github.com/langchain-ai/langchain/pull/9614 -- will solve deduplication for vectorstores implementing add and delete interfaces

dosubot[bot] commented 11 months ago

Hi, @sunson2k! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding, the issue you reported is about the duplication of document contents and embeddings when using FAISS.add_texts and FAISS.merge_from. It seems that the expected behavior is to create a new index if the content or embedding doesn't exist, and update the value (document metadata) if it does. Atisharma and neldivad also expressed their interest in having this feature. Additionally, Eyurtsev mentioned that a pull request (https://github.com/langchain-ai/langchain/pull/9614) will solve the deduplication issue for vectorstores implementing add and delete interfaces.

If this issue is still relevant to the latest version of the LangChain repository, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain project! Let me know if there's anything else I can assist you with.