Closed MohammedShokr closed 5 months ago
@MohammedShokr thanks for reporting.
There isn't enough context here to determine whether the bug is in the SQLRecordManager or in another component in LangChain or in user code.
If you're able to isolate the problem or provide a minimal reproducible script. It should contain all relevant imports and data to index.
HI @eyurtsev I understand your concern about isolating the issue. Unfortunately, I'm unable to share the data as it's confidential. However, I've taken measures to ensure consistency. I've saved the docs before indexing as a pickle file from different runs and compared them both for page content and metadata and they are the same.
Update: I've identified the root cause of the issue. It stems from the order of the document list and the batch size. Since the summaries are appended to the end of the document list, they end up being indexed in a separate batch, triggering updates. To address this, I've increased the batch size to cover all my documents. However, I'm still wondering how to make sure that all document chunks are consistently included in the same batch. Any insights or suggestions on how to achieve this would be greatly appreciated.
@MohammedShokr for a minimal reproducible example, it's enough to seed this with a test case involving fake data.
documents = [
Document(page_content='hello', metadata={'source': 1}),
Document(page_content='goodbye', metadata={'source': 2})
Document(page_content='meow', metadata={'source': 3})
Document(page_content='woof', metadata={'source': 1})
]
Is the claim that indexing this with a batch size of 2 creates incorrect results? If you're able to create a test case like that together with information about you see. vs what you expect to see that's very helpful for us to fix the issue
Here's a script to reproduce the issue, run this script twice and you will see that the record manager re-ingests the first document because one of its chunks came out of the batch.
from langchain.indexes import SQLRecordManager, index
from langchain.docstore.document import Document
from langchain_weaviate import WeaviateVectorStore
from langchain_openai import OpenAIEmbeddings
from dotenv import load_dotenv
import os
import weaviate
# Load env variables
_ = load_dotenv()
WEAVIATE_API_KEY = os.environ["WEAVIATE_API_KEY"]
WEAVIATE_HTTP_HOST = os.environ["WEAVIATE_HTTP_HOST"]
WEAVIATE_HTTP_PORT = os.environ["WEAVIATE_HTTP_PORT"]
WEAVIATE_GRPC_HOST = os.environ["WEAVIATE_GRPC_HOST"]
WEAVIATE_GRPC_PORT = os.environ["WEAVIATE_GRPC_PORT"]
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
# Create a Weaviate client and connect to it
client = weaviate.client.WeaviateClient(
connection_params = weaviate.connect.ConnectionParams.from_params(
http_host = WEAVIATE_HTTP_HOST, http_port=WEAVIATE_HTTP_PORT, http_secure=False,
grpc_host=WEAVIATE_GRPC_HOST, grpc_port=WEAVIATE_GRPC_PORT, grpc_secure=False
),
auth_client_secret=weaviate.auth.AuthApiKey(api_key=WEAVIATE_API_KEY),
)
client.connect()
# Create a Weaviate vectorstore
index_name = "Test_index"
vectorstore = WeaviateVectorStore(client,
index_name = index_name,
embedding = OpenAIEmbeddings(),
text_key = "text",)
# Create the record manager
namespace = f"weaviate/{index_name}"
record_manager = SQLRecordManager(
namespace, db_url="sqlite:///record_manager_cache.sql"
)
record_manager.create_schema()
# Define the documents
documents = [
Document(page_content='hello', metadata={'source': 1}),
Document(page_content='goodbye', metadata={'source': 2}),
Document(page_content='meow', metadata={'source': 3}),
Document(page_content='woof', metadata={'source': 1}),
]
# Index the documents
results = index(
documents,
record_manager,
vectorstore,
cleanup="incremental",
source_id_key="source",
batch_size=2,
)
client.close()
print(results)
OK I recreated the issue.
incremental
mode was written to support continuous clean up (i.e., minimize the amount of time when duplicated content exists in the index). This only really works if content derived from the same document is present in the same batch.
If this condition is not met, the indexing code will not be able to avoid some redundant work (i.e., it'll end up forcefully re-indexing content that it should've skipped). The end state of the index is still correct (as long as there was no network failure in the middle etc.)
What I need to do is:
incremental
and full_mode
. This mode will not be able to optimize for duplicated content present in the index between batches.Thank you for clarifying the situation,
I'm considering increasing the batch_size
to cover my batch of documents (I need to re-index around 5k docs every day). However, I'm hesitant due to potential consequences. Could you please confirm whether increasing the batch size will effectively mitigate the issue of redundant work while potentially slowing down the cleanup process?
That's correct: it will decrease redundant work, but increase time when duplicates might exist.
You can handle the issue entirely on your side by grouping documents that share the same source id into the same batch, and controlling the batch size dynamically.
I haven't checked, but i hope that the indexing API works without a batch size, if that's the case you should be able to entirely control the indexing behavior without any oddities of having to dynamically calculate a batch size.
Thank you for the insight! Closing the issue now.
Checked other resources
Example Code
Error Message and Stack Trace (if applicable)
No response
Description
When indexing a list of documents utilizing the record manager in incremental deletion mode, with each document assigned a unique identifier (UUID) as the source, I encounter an unexpected behavior. The record manager deletes and re-indexes a subset of documents even though there have been no changes to those documents. Upon rerunning the same code with identical documents, the output is
{'num_added': 80, 'num_updated': 0, 'num_skipped': 525, 'num_deleted': 80}
.Furthermore, I am using a recursive text splitter to segment the documents; also I am generating a summary for each document and I change the summary metadata to add the source of the original document so it is considered as a chunk of the original document.
Finally, please note that I tried the same code on different sets of documents and the issue is not consistent.
System Info
System Information
Package Information
Packages not installed (Not Necessarily a Problem)
The following packages were not found: