'Qdrant.from_documents' Memory overflow

zhugexinxin commented 1 year ago

Issue you'd like to raise.

def handle_embeddings(payload):
     loader = UnstructuredPDFLoader(payload["filepath"])
     documents = loader.load()

    text_splitter = SpacyTextSplitter(pipeline=payload["language"], chunk_size=1536, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    embeddings = OpenAIEmbeddings(**payload)
    qdrant = Qdrant.from_documents(
        docs, embeddings,
        path=path,
        collection_name=collection_name,
    )

import dramatiq
from dramatiq.brokers.redis import RedisBroker
from tasks import handle_embeddings

redis_broker = RedisBroker(url='redis://redis.helloreader.docker/10')
dramatiq.set_broker(redis_broker)

@dramatiq.actor(max_retries = 0)
def handle_embeddings_task(payload):
    result = handle_embeddings(payload)
    return result

Due to the time-consuming nature of embeddings and storing them in a vector database, I opted for asynchronous queue tasks to handle them. However, I noticed that when processing documents of size 30 MB, the memory usage of the queue task kept increasing until it eventually crashed due to overflow. At this point, I investigated and found that the memory overflow occurred even before the embeddings interface was called, indicating that the issue was with the Qdrant.from_documents method. I have been searching for the root cause for a while but haven't found it yet.

Suggestion:

I hope someone who is familiar with the Qdrant.from_documents method or has knowledge of other possible causes can help me resolve this issue. The document size of approximately 30 MB corresponds to approximately 560,000 tokens. During the process, I tried using Dramatiq, Celery, and RQ, and encountered the same issue with all of them. Therefore, we can exclude the possibility of the issue being specific to these queue tools.

dosubot[bot] commented 1 year ago

Answer generated by a 🤖

Answer

I understand that you're experiencing a memory overflow issue when processing large documents using the Qdrant.from_documents method in your LangChain application. This issue seems to occur before the embeddings interface is called, and you've observed it across different queue tools, suggesting that it's not specific to any particular tool.

The issue appears to be related to the size of the documents you're processing. When the Qdrant.from_documents method tries to process all the documents at once, it can lead to high memory usage and eventually a memory overflow, especially with large documents.

One potential solution to this problem is to process the documents in smaller batches. This can be done by modifying the Qdrant.from_documents method to process a subset of the documents at a time, which should help reduce the memory usage.

Here's how you can modify your code to process the documents in batches:

def handle_embeddings(payload):
    loader = UnstructuredPDFLoader(payload["filepath"])
    documents = loader.load()

    text_splitter = SpacyTextSplitter(pipeline=payload["language"], chunk_size=1536, chunk_overlap=0)
    docs = text_splitter.split_documents(documents)

    embeddings = OpenAIEmbeddings(**payload)

    # Process documents in batches
    batch_size = 10000  # Adjust this value based on your system's memory capacity
    for i in range(0, len(docs), batch_size):
        batch_docs = docs[i:i+batch_size]
        qdrant = Qdrant.from_documents(
            batch_docs, embeddings,
            path=path,
            collection_name=collection_name,
        )

In this modified version, the documents are divided into batches of a size that you can specify based on your system's memory capacity, and each batch is processed separately. This approach should significantly reduce the memory usage and help prevent the memory overflow issue.

Please note that this is just one potential solution, and there might be other factors at play. If you're using an older version of LangChain, I would also recommend upgrading to the latest version to ensure you have all the recent optimizations and bug fixes.

If this solution works for you and you believe it's a bug that could impact other users, I encourage you to make a pull request with the proposed changes. Your contribution could help improve the LangChain framework for everyone.

Sources

langchain/vectorstores/qdrant.py

This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

zhugexinxin commented 1 year ago

langchain/vectorstores/qdrant.py

Can this solution be added to the same collection_name, and it seems to be reembeddings every time, very wasteful tokens.

docs = text_splitter.split_documents(documents)

The memory overflow also crashes when splitting files that are too large.

hopkins385 commented 1 year ago

yes, I've just opened the following PR which adds the ability to add vectors to an existing collection in an automated way. #7334

zhugexinxin commented 1 year ago

第7334章

Thank you very much, but I have a question.

"I have a scenario where I am dividing a PDF book into 319 chunks. I anticipate that Qdrant.from_documents will initiate 319 embeddings API calls. However, in every experiment, I've observed that it only initiates the embeddings API call once with approximately 1000 tokens, and then sends the remainder all at once. This can cause issues due to size limits.

I have looked at the source code and my expectation is that it should send the chunks in separate API calls. I'm unable to understand why this is happening. Can you please explain?"

hopkins385 commented 1 year ago

mh.. not sure if I can help you with that. To me it's unclear if your issue is caused by one or multiple of the following:

batch size
memory limits
api rate limits exceeded (RPM/TPM)
issues with the file itself
issues with the host machine (e.g. high cpu load, low memory)

So I guess you need to dig a bit deeper and debug. I would propose to start with a smaller file. eg. split manually the file in multiple files and try to feed the smaller files (piece by piece) into the system. If that works, start increasing the filesize (with filesize I mean amount of characters/pages)

In my case I am using a different approach and use the following code as described here https://python.langchain.com/docs/modules/data_connection/retrievers PS: I've indexed pdf files with up to 150 pages without an issue. Maybe worth to try it out? idk

VectorstoreIndexCreator(
    vectorstore_cls=Qdrant,
    embedding=OpenAIEmbeddings(),
    text_splitter=SpacyTextSplitter(pipeline=payload["language"], chunk_size=1536, chunk_overlap=0)
    vectorstore_kwargs=dict(
          host=qdrant
          port=6333
          grpc_port=6334
          prefer_grpc=True
          collection_name=collection_name,
          recreate_collection=False, # <-- this is new, and only available after my PR was commited to main
          vector_size=1536 # <-- this is new, and only available after my PR was commited to main
       ),
).from_documents(documents)

zhugexinxin commented 1 year ago

mh.. not sure if I can help you with that. To me it's unclear if your issue is caused by one or multiple of the following:

batch size

memory limits

api rate limits exceeded (RPM/TPM)

issues with the file itself

issues with the host machine (e.g. high cpu load, low memory)

So I guess you need to dig a bit deeper and debug. I would propose to start with a smaller file. eg. split manually the file in multiple files and try to feed the smaller files (piece by piece) into the system. If that works, start increasing the filesize (with filesize I mean amount of characters/pages)

In my case I am using a different approach and use the following code as described here https://python.langchain.com/docs/modules/data_connection/retrievers PS: I've indexed pdf files with up to 150 pages without an issue. Maybe worth to try it out? idk
VectorstoreIndexCreator(
    vectorstore_cls=Qdrant,
    embedding=OpenAIEmbeddings(),
    text_splitter=SpacyTextSplitter(pipeline=payload["language"], chunk_size=1536, chunk_overlap=0)
    vectorstore_kwargs=dict(
          host=qdrant
          port=6333
          grpc_port=6334
          prefer_grpc=True
          collection_name=collection_name,
          recreate_collection=False, # <-- this is new, and only available after my PR was commited to main
          vector_size=1536 # <-- this is new, and only available after my PR was commited to main
       ),
).from_documents(documents)

I've attempted to consider various factors that might be causing my issue:

Batch size (I've already divided the document into 319 chunks)
Memory limit (I have 8GB of RAM)
Exceeding the API rate limit (RPM/TPM) (the limit is 350,000 tokens and 3,500 requests per minute, which I'm sure I haven't exceeded)
Issues with the file itself (this has been ruled out)
Host issues (such as high CPU load, low memory) (this has been ruled out)
Types of text (I've tried multiple texts, 300,000 tokens, 600,000 tokens)

Here is the code I'm executing:

loader = UnstructuredPDFLoader(filepath)
documents = loader.load()
text_splitter = TokenTextSplitter(chunk_size=1536, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
# len(docs)
# 319
embeddings = OpenAIEmbeddings(**payload)
qdrant = Qdrant.from_documents(
    docs, embeddings,
    path=vectorstore_path,
    collection_name=collection_name,
)

Additionally, I would like to highlight that there seems to be a memory overflow issue with the SpacyTextSplitter when you have a very large document and you're using the Chinese language model. It appears that the memory starts to dramatically increase until it leads to memory overflow and the program exits.

After executing, the result I get from the official API is as follows:

2023/7/8 17:28:21 | text-embedding-ada-002 Model Request, total tokens: 344317 2023/7/8 17:28:15 | text-embedding-ada-002 Model Request, total tokens: 813

hopkins385 commented 1 year ago

@rlancemartin any ideas if this could be a memory leak?

dosubot[bot] commented 11 months ago

Hi, @zhugexinxin! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

Based on my understanding of the current state of the issue, you are experiencing a memory overflow problem when using the Qdrant.from_documents method with large documents. There has been a suggestion to process the documents in smaller batches to reduce memory usage. Additionally, there is a discussion about adding the ability to add vectors to an existing collection and a suggestion to try using the VectorstoreIndexCreator class as an alternative approach.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

langchain-ai / langchain