Closed zhugexinxin closed 11 months ago
Answer generated by a 🤖
I understand that you're experiencing a memory overflow issue when processing large documents using the Qdrant.from_documents
method in your LangChain application. This issue seems to occur before the embeddings interface is called, and you've observed it across different queue tools, suggesting that it's not specific to any particular tool.
The issue appears to be related to the size of the documents you're processing. When the Qdrant.from_documents
method tries to process all the documents at once, it can lead to high memory usage and eventually a memory overflow, especially with large documents.
One potential solution to this problem is to process the documents in smaller batches. This can be done by modifying the Qdrant.from_documents
method to process a subset of the documents at a time, which should help reduce the memory usage.
Here's how you can modify your code to process the documents in batches:
def handle_embeddings(payload):
loader = UnstructuredPDFLoader(payload["filepath"])
documents = loader.load()
text_splitter = SpacyTextSplitter(pipeline=payload["language"], chunk_size=1536, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(**payload)
# Process documents in batches
batch_size = 10000 # Adjust this value based on your system's memory capacity
for i in range(0, len(docs), batch_size):
batch_docs = docs[i:i+batch_size]
qdrant = Qdrant.from_documents(
batch_docs, embeddings,
path=path,
collection_name=collection_name,
)
In this modified version, the documents are divided into batches of a size that you can specify based on your system's memory capacity, and each batch is processed separately. This approach should significantly reduce the memory usage and help prevent the memory overflow issue.
Please note that this is just one potential solution, and there might be other factors at play. If you're using an older version of LangChain, I would also recommend upgrading to the latest version to ensure you have all the recent optimizations and bug fixes.
If this solution works for you and you believe it's a bug that could impact other users, I encourage you to make a pull request with the proposed changes. Your contribution could help improve the LangChain framework for everyone.
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
- langchain/vectorstores/qdrant.py
Can this solution be added to the same collection_name, and it seems to be reembeddings every time, very wasteful tokens.
docs = text_splitter.split_documents(documents)
The memory overflow also crashes when splitting files that are too large.
yes, I've just opened the following PR which adds the ability to add vectors to an existing collection in an automated way. #7334
第7334章
Thank you very much, but I have a question.
"I have a scenario where I am dividing a PDF book into 319 chunks. I anticipate that Qdrant.from_documents
will initiate 319 embeddings
API calls. However, in every experiment, I've observed that it only initiates the embeddings
API call once with approximately 1000 tokens, and then sends the remainder all at once. This can cause issues due to size limits.
I have looked at the source code and my expectation is that it should send the chunks in separate API calls. I'm unable to understand why this is happening. Can you please explain?"
mh.. not sure if I can help you with that. To me it's unclear if your issue is caused by one or multiple of the following:
So I guess you need to dig a bit deeper and debug. I would propose to start with a smaller file. eg. split manually the file in multiple files and try to feed the smaller files (piece by piece) into the system. If that works, start increasing the filesize (with filesize I mean amount of characters/pages)
In my case I am using a different approach and use the following code as described here https://python.langchain.com/docs/modules/data_connection/retrievers PS: I've indexed pdf files with up to 150 pages without an issue. Maybe worth to try it out? idk
VectorstoreIndexCreator(
vectorstore_cls=Qdrant,
embedding=OpenAIEmbeddings(),
text_splitter=SpacyTextSplitter(pipeline=payload["language"], chunk_size=1536, chunk_overlap=0)
vectorstore_kwargs=dict(
host=qdrant
port=6333
grpc_port=6334
prefer_grpc=True
collection_name=collection_name,
recreate_collection=False, # <-- this is new, and only available after my PR was commited to main
vector_size=1536 # <-- this is new, and only available after my PR was commited to main
),
).from_documents(documents)
mh.. not sure if I can help you with that. To me it's unclear if your issue is caused by one or multiple of the following:
- batch size
- memory limits
- api rate limits exceeded (RPM/TPM)
- issues with the file itself
- issues with the host machine (e.g. high cpu load, low memory)
So I guess you need to dig a bit deeper and debug. I would propose to start with a smaller file. eg. split manually the file in multiple files and try to feed the smaller files (piece by piece) into the system. If that works, start increasing the filesize (with filesize I mean amount of characters/pages)
In my case I am using a different approach and use the following code as described here https://python.langchain.com/docs/modules/data_connection/retrievers PS: I've indexed pdf files with up to 150 pages without an issue. Maybe worth to try it out? idk
VectorstoreIndexCreator( vectorstore_cls=Qdrant, embedding=OpenAIEmbeddings(), text_splitter=SpacyTextSplitter(pipeline=payload["language"], chunk_size=1536, chunk_overlap=0) vectorstore_kwargs=dict( host=qdrant port=6333 grpc_port=6334 prefer_grpc=True collection_name=collection_name, recreate_collection=False, # <-- this is new, and only available after my PR was commited to main vector_size=1536 # <-- this is new, and only available after my PR was commited to main ), ).from_documents(documents)
I've attempted to consider various factors that might be causing my issue:
Here is the code I'm executing:
loader = UnstructuredPDFLoader(filepath)
documents = loader.load()
text_splitter = TokenTextSplitter(chunk_size=1536, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# len(docs)
# 319
embeddings = OpenAIEmbeddings(**payload)
qdrant = Qdrant.from_documents(
docs, embeddings,
path=vectorstore_path,
collection_name=collection_name,
)
Additionally, I would like to highlight that there seems to be a memory overflow issue with the SpacyTextSplitter when you have a very large document and you're using the Chinese language model. It appears that the memory starts to dramatically increase until it leads to memory overflow and the program exits.
After executing, the result I get from the official API is as follows:
2023/7/8 17:28:21 | text-embedding-ada-002 Model Request, total tokens: 344317 2023/7/8 17:28:15 | text-embedding-ada-002 Model Request, total tokens: 813
@rlancemartin any ideas if this could be a memory leak?
Hi, @zhugexinxin! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
Based on my understanding of the current state of the issue, you are experiencing a memory overflow problem when using the Qdrant.from_documents
method with large documents. There has been a suggestion to process the documents in smaller batches to reduce memory usage. Additionally, there is a discussion about adding the ability to add vectors to an existing collection and a suggestion to try using the VectorstoreIndexCreator
class as an alternative approach.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your understanding and contribution to the LangChain project!
Issue you'd like to raise.
Due to the time-consuming nature of embeddings and storing them in a vector database, I opted for asynchronous queue tasks to handle them. However, I noticed that when processing documents of size 30 MB, the memory usage of the queue task kept increasing until it eventually crashed due to overflow. At this point, I investigated and found that the memory overflow occurred even before the embeddings interface was called, indicating that the issue was with the
Qdrant.from_documents
method. I have been searching for the root cause for a while but haven't found it yet.Suggestion:
I hope someone who is familiar with the
Qdrant.from_documents
method or has knowledge of other possible causes can help me resolve this issue. The document size of approximately 30 MB corresponds to approximately 560,000 tokens. During the process, I tried using Dramatiq, Celery, and RQ, and encountered the same issue with all of them. Therefore, we can exclude the possibility of the issue being specific to these queue tools.