langchain-ai / langchain

🦜🔗 Build context-aware reasoning applications
https://python.langchain.com
MIT License
91.25k stars 14.51k forks source link

Parse msword file then split chunks cause oom #24115

Open jawar-cn opened 1 month ago

jawar-cn commented 1 month ago

Checked other resources

Example Code

#1st key code
            loader = UnstructuredWordDocumentLoader(self.file_path, mode="paged", strategy="fast", infer_table_structure=EXTRACT_TABLES)

#2ed key code
    def create_documents(
        self, texts: List[str], metadatas: Optional[List[dict]] = None
    ) -> List[Document]:
        """Create documents from a list of texts."""
        _metadatas = metadatas or [{}] * len(texts)
        documents = []
        for i, text in enumerate(texts):
            index = 0
            previous_chunk_len = 0
            for chunk in self.split_text(text):
                metadata = copy.deepcopy(_metadatas[i])
                if self._add_start_index:
                    offset = index + previous_chunk_len - self._chunk_overlap
                    index = text.find(chunk, max(0, offset))
                    metadata["start_index"] = index
                    previous_chunk_len = len(chunk)
                new_doc = Document(page_content=chunk, metadata=metadata)
                documents.append(new_doc)
        return documents

Error Message and Stack Trace (if applicable)

No response

Description

Use from langchain_community.document_loaders import UnstructuredWordDocumentLoader to parse word file with big table, then got Document objects. if param infer_table_structure=True which is default param, each document metadata property contains a text_as_html property which is a big object. Then when using TextSplitter to split documents into chunks, every chunks document will deepcopy metadata once. If there are many chunks, the memory will increase sharply, and finally oom, and the program will terminate.

System Info

langchain==0.2.3 langchain-community==0.2.4 langchain-core==0.2.5 langchain-text-splitters==0.2.1

radheradhe01 commented 1 month ago

The issue is likely due to the text_as_html property in the metadata, which is a large object and is being deep copied unnecessarily.

To fix this, you can try the following , you can try setting infer_table_structure=False when creating the UnstructuredWordDocumentLoader, like this:

loader = UnstructuredWordDocumentLoader(self.file_path, mode="paged", strategy="fast", infer_table_structure=False)

This will prevent the text_as_html property from being generated and reduce the memory usage.

if this issue presists , comment on this issue again and I will further look into this issue

jawar-cn commented 1 month ago

Thanks , And yes, that's what I tried, and it fixed, and I think when deepcopy metadata, some case may need to selectively ignore some attributes. in case of sometime need to set infer_table_structure=True and split text at the same time.

radheradhe01 commented 1 month ago

Glad to hear that 😊