Parse msword file then split chunks cause oom

jawar-cn commented 1 month ago

Checked other resources

[X] I added a very descriptive title to this issue.
[X] I searched the LangChain documentation with the integrated search.
[X] I used the GitHub search to find a similar question and didn't find it.
[X] I am sure that this is a bug in LangChain rather than my code.
[X] The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

#1st key code
            loader = UnstructuredWordDocumentLoader(self.file_path, mode="paged", strategy="fast", infer_table_structure=EXTRACT_TABLES)

#2ed key code
    def create_documents(
        self, texts: List[str], metadatas: Optional[List[dict]] = None
    ) -> List[Document]:
        """Create documents from a list of texts."""
        _metadatas = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            index = 0
            previous_chunk_len = 0
            for chunk in self.split_text(text):
                metadata = copy.deepcopy(_metadatas[i])
                if self._add_start_index:
                    offset = index + previous_chunk_len - self._chunk_overlap
                    index = text.find(chunk, max(0, offset))
                    metadata["start_index"] = index
                    previous_chunk_len = len(chunk)
                new_doc = Document(page_content=chunk, metadata=metadata)
                documents.append(new_doc)
        return documents

Error Message and Stack Trace (if applicable)

No response

Description

Use from langchain_community.document_loaders import UnstructuredWordDocumentLoader to parse word file with big table, then got Document objects. if param infer_table_structure=True which is default param, each document metadata property contains a text_as_html property which is a big object. Then when using TextSplitter to split documents into chunks, every chunks document will deepcopy metadata once. If there are many chunks, the memory will increase sharply, and finally oom, and the program will terminate.

System Info

langchain==0.2.3 langchain-community==0.2.4 langchain-core==0.2.5 langchain-text-splitters==0.2.1

radheradhe01 commented 1 month ago

The issue is likely due to the text_as_html property in the metadata, which is a large object and is being deep copied unnecessarily.

To fix this, you can try the following , you can try setting infer_table_structure=False when creating the UnstructuredWordDocumentLoader, like this:

loader = UnstructuredWordDocumentLoader(self.file_path, mode="paged", strategy="fast", infer_table_structure=False)

This will prevent the text_as_html property from being generated and reduce the memory usage.

if this issue presists , comment on this issue again and I will further look into this issue

jawar-cn commented 1 month ago

Thanks , And yes, that's what I tried, and it fixed, and I think when deepcopy metadata, some case may need to selectively ignore some attributes. in case of sometime need to set infer_table_structure=True and split text at the same time.

radheradhe01 commented 1 month ago

Glad to hear that 😊

langchain-ai / langchain