langchain-ai / rag-from-scratch

2.23k stars 675 forks source link

Duplicated records in Chrome vectorstore after multiple cell executions #1

Open labdmitriy opened 7 months ago

labdmitriy commented 7 months ago

Hi @rlancemartin,

First of all thanks a lot for this series of lessons!

Probably it is known fact but for me it was not clearly for the first time when I found it, that if we run the cell this code from your Jupyter notebook for Lessons 1-4 multiple (for example, k) times:

vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=embeddings)
retriever = vectorstore.as_retriever()

Then there will be k duplicated records for each original record, because this method added documents even if collection already exists. We can check it using this code for example:

vectorstore_data = vectorstore.get()
print(len(vectorstore_data['documents']))

As I remember, I saw similar behavior for langchain wrapper of Weaviate database.

So as a quick workaround we can remove default collection (which has name "langchain") before we add documents:

collection_name = 'langchain'
Chroma(collection_name=collection_name).delete_collection()
vectorstore = Chroma.from_documents(documents=splits, 
                                    embedding=embeddings)
retriever = vectorstore.as_retriever()

Since there are no warnings or errors about existing collection, this feature may not be immediately noticed, so I hope it will be useful to someone.

P.S. I also noticed that during Part 4 here we can see that 4 documents are retrieved where 2 of them are duplicates of another ones.

Thank you.

rlancemartin commented 7 months ago

Yes! This is a good call out.

I will add a note in the notebooks on this.