Open ericvaillancourt opened 4 months ago
Apologies documentation is out of date on this. For the indexing function to be able to completely avoid redundant work, all the docs corresponding to a particular source need to be in the same batch. I'll try to update documentation.
If that criteria isn't met, it'll end up doing some redundant work, but should still result in the correct end state. The indexing logic optimizes for the amount of time that duplicated content exists in the index.
OK but the batch size is set to 100. What if one source has more than 100 docs. The end result is still ok but does it re-calculate the embeddings?
Hello, I am also hitting this problem. if I do not increase batc_size in indexer to be > than document size, I have deletes and adds although I did not change anything in the directory I am loading. If batch size is > number of loaded documents, then the skip happens and everything is fine.
So, something seems not to be right here.
Hi! i also ran into this problem and worked on a solution for this issue in this PR. @eyurtsev i hope this can be helpful.
Checked other resources
Example Code
Here is an example that demonstrates the problem:
If I change the
batch_size
inapi.py
to a value that is larger than the number of elements in my list, everything works fine. By default, thebatch_size
is set to 100, and only the first 100 elements are handled correctly.Error Message and Stack Trace (if applicable)
No response
Description
I've encountered a bug in the index function of Langchain when processing documents. The function behaves inconsistently during multiple runs, leading to unexpected deletions of documents. Specifically, when running the function twice in a row without any changes to the data, the first run indexes all documents as expected. However, on the second run, only the first batch of documents (batch_size=100) is correctly identified as already indexed and skipped, while the remaining documents are mistakenly deleted and re-indexed.
System Info
langchain==0.1.20 langchain-community==0.0.38 langchain-core==0.1.52 langchain-openai==0.1.7 langchain-postgres==0.0.4 langchain-text-splitters==0.0.2 langgraph==0.0.32 langsmith==0.1.59
Python 3.11.7
Platform : Windows 11