langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
92.86k stars 14.88k forks source link

Bug in Indexing Function Causes Inconsistent Document Deletion #22135

Open ericvaillancourt opened 4 months ago

ericvaillancourt commented 4 months ago

Checked other resources

Example Code

Here is an example that demonstrates the problem:

If I change the batch_size in api.py to a value that is larger than the number of elements in my list, everything works fine. By default, the batch_size is set to 100, and only the first 100 elements are handled correctly.

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document
from langchain.indexes import SQLRecordManager, index

embeddings = OpenAIEmbeddings()

documents = []

for i in range(1, 201):
    page_content = f"data {i}"
    metadata = {"source": f"test.txt"}
    document = Document(page_content=page_content, metadata=metadata)
    documents.append(document)

collection_name = "test_index"

embedding = OpenAIEmbeddings()

vectorstore = Chroma(
    persist_directory="emb",
    embedding_function=embeddings
)
namespace = f"choma/{collection_name}"
record_manager = SQLRecordManager(
    namespace, db_url="sqlite:///record_manager_cache.sql"
)

record_manager.create_schema()

idx = index(
    documents,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)
# for the first run
# should be : {'num_added': 200, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
# and that's what we get.
print(idx)
idx = index(
    documents,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)
# for the second run
# should be : {'num_added': 0, 'num_updated': 0, 'num_skipped': 200, 'num_deleted': 0}
# but we get : {'num_added': 100, 'num_updated': 0, 'num_skipped': 100, 'num_deleted': 100}
print(idx)

Error Message and Stack Trace (if applicable)

No response

Description

I've encountered a bug in the index function of Langchain when processing documents. The function behaves inconsistently during multiple runs, leading to unexpected deletions of documents. Specifically, when running the function twice in a row without any changes to the data, the first run indexes all documents as expected. However, on the second run, only the first batch of documents (batch_size=100) is correctly identified as already indexed and skipped, while the remaining documents are mistakenly deleted and re-indexed.

System Info

langchain==0.1.20 langchain-community==0.0.38 langchain-core==0.1.52 langchain-openai==0.1.7 langchain-postgres==0.0.4 langchain-text-splitters==0.0.2 langgraph==0.0.32 langsmith==0.1.59

Python 3.11.7

Platform : Windows 11

eyurtsev commented 4 months ago

Apologies documentation is out of date on this. For the indexing function to be able to completely avoid redundant work, all the docs corresponding to a particular source need to be in the same batch. I'll try to update documentation.

If that criteria isn't met, it'll end up doing some redundant work, but should still result in the correct end state. The indexing logic optimizes for the amount of time that duplicated content exists in the index.

ericvaillancourt commented 4 months ago

OK but the batch size is set to 100. What if one source has more than 100 docs. The end result is still ok but does it re-calculate the embeddings?

magaton commented 4 months ago

Hello, I am also hitting this problem. if I do not increase batc_size in indexer to be > than document size, I have deletes and adds although I did not change anything in the directory I am loading. If batch size is > number of loaded documents, then the skip happens and everything is fine.

So, something seems not to be right here.

ericvaillancourt commented 3 months ago

I created my own indexing system to solve the problem. It is a bit more sophisticated because it is meant to be used with a multi-vector retriever. I have written an article on Medium.

You can find the code on my github

And watch the video on Youtube

federico-pisanu commented 1 month ago

Hi! i also ran into this problem and worked on a solution for this issue in this PR. @eyurtsev i hope this can be helpful.