run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
36.6k stars 5.24k forks source link

[Question]: Adding page number from word document in metadata #16118

Open adityamity opened 1 month ago

adityamity commented 1 month ago

Question Validation

Question

I am reading word file (.docx) using the below code:

from llama_index.core import Document, SimpleDirectoryReader
filename_fn = lambda filename: {"file_name": filename}
documents = SimpleDirectoryReader(data_path,  
                                  file_metadata=filename_fn, 
                                  #recursive=True, 
                                   #num_files_limit= num_files_limit, 
                                  required_exts=[".docx",".pdf"], 
                                 filename_as_id=True).load_data()

How can i add page number to this document metadata?

dosubot[bot] commented 1 month ago

To add page numbers to the metadata of a Word document (.docx) using the SimpleDirectoryReader, you need to define a custom file_metadata function that extracts the page numbers from the document and includes them in the metadata dictionary. Here is an example of how you can achieve this:

  1. Install the python-docx library if you haven't already:

    pip install python-docx
  2. Define a custom file_metadata function that extracts the page numbers from the Word document.

  3. Pass this custom function to the SimpleDirectoryReader.

Here is the code to accomplish this:

from docx import Document
from llama_index.core.readers.file.base import SimpleDirectoryReader
from typing import Dict

def get_page_count(docx_path: str) -> int:
    doc = Document(docx_path)
    # Assuming each section starts on a new page
    return len(doc.element.xpath('//w:sectPr'))

def custom_file_metadata(file_path: str) -> Dict:
    metadata = {}
    if file_path.endswith('.docx'):
        metadata['page_count'] = get_page_count(file_path)
    return metadata

# Initialize the SimpleDirectoryReader with the custom file_metadata function
reader = SimpleDirectoryReader(
    input_dir='path/to/your/directory',
    file_metadata=custom_file_metadata,
    required_exts=[".docx", ".pdf"],
    filename_as_id=True
)

documents = reader.load_data()

In this code:

This will ensure that the page numbers are added to the metadata of each Word document processed by the SimpleDirectoryReader [1].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu I am facing issue again. I have one file with different file type(.docx and .pdf) kept on the same path. The file contain 4 pages. When i read only .pdf file using the below code:

from llama_index.core import Document, SimpleDirectoryReader
filename_fn = lambda filename: {"file_name": filename}
documents = SimpleDirectoryReader(data_path,  
                                  file_metadata=filename_fn, 
                                  #recursive=True, 
                                   #num_files_limit= num_files_limit, 
                                  required_exts=[".pdf"], 
                                 filename_as_id=True).load_data()

print(f'length of documents : {len(documents)}')

length of documents is coming as 4 and i got the below metadata:

documents[0].metadata

{'page_label': '1',
 'file_name': file_name}

But i read the .docx file using the same code, i see the page_label is missing in the metadata. How can i get the page_label for all the pages when reading the docx file.

dosubot[bot] commented 1 month ago

To get the page_label for all pages when reading a .docx file, you need to customize the file_metadata function to include page numbers. However, unlike PDFs, .docx files do not inherently contain page number information. You can estimate page numbers based on the content length or other criteria.

Here's how you can modify your code to include page_label for .docx files:

  1. Install the python-docx library if you haven't already:

    pip install python-docx
  2. Define a custom file_metadata function that estimates page numbers for .docx files.

  3. Pass this custom function to the SimpleDirectoryReader.

Here is the updated code:

from docx import Document as DocxDocument
from llama_index.core.readers.file.base import SimpleDirectoryReader
from typing import Dict

def get_page_count(docx_path: str) -> int:
    doc = DocxDocument(docx_path)
    # Assuming each section starts on a new page
    return len(doc.element.xpath('//w:sectPr'))

def custom_file_metadata(file_path: str) -> Dict:
    metadata = {}
    if file_path.endswith('.docx'):
        page_count = get_page_count(file_path)
        metadata['page_label'] = [str(i + 1) for i in range(page_count)]
    return metadata

# Initialize the SimpleDirectoryReader with the custom file_metadata function
reader = SimpleDirectoryReader(
    input_dir='path/to/your/directory',
    file_metadata=custom_file_metadata,
    required_exts=[".docx", ".pdf"],
    filename_as_id=True
)

documents = reader.load_data()

# Print metadata for each document
for doc in documents:
    print(doc.metadata)

In this code:

This will ensure that the page_label is added to the metadata of each .docx file processed by the SimpleDirectoryReader [1][2].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu the above code is not working. When i run the above code for .docx file. The length of the document is coming as 1, however the document length is 4 as i have 4 pages in my document. I am getting the below result:

documents[0].metadata
{'file_name': 'file1.docx', 'page_label': ['1']}
dosubot[bot] commented 1 month ago

To correctly get the page_label for all pages when reading a .docx file, you need to iterate over each page and extract the page_label for each one. The current implementation returns only one document with a single page label because it doesn't split the document into pages.

Here's an updated version of the DocxReader class that includes page-level extraction and labeling:

from pathlib import Path
from typing import List, Optional, Dict
from llama_index.core import Document
from llama_index.readers.file.base import BaseReader

class DocxReader(BaseReader):
    """Docx parser."""

    def load_data(
        self,
        file: Path,
        extra_info: Optional[Dict] = None,
    ) -> List[Document]:
        """Parse file."""
        try:
            import docx
        except ImportError:
            raise ImportError(
                "python-docx is required to read Microsoft Word files: "
                "`pip install python-docx`"
            )

        doc = docx.Document(file)
        docs = []
        for i, paragraph in enumerate(doc.paragraphs):
            page_text = paragraph.text
            page_label = f"Page {i + 1}"

            metadata = {"page_label": page_label, "file_name": file.name}
            if extra_info is not None:
                metadata.update(extra_info)

            docs.append(Document(text=page_text, metadata=metadata))

        return docs

# Usage example
file_path = Path("path_to_your_docx_file.docx")
loader = DocxReader()
documents = loader.load_data(file_path)

# Print metadata for each document
for doc in documents:
    print(doc.metadata)

This code uses the python-docx library to read the .docx file and iterates over each paragraph, treating each paragraph as a separate "page" for the purpose of labeling. Each paragraph is assigned a page_label based on its order in the document. Adjust the logic as needed to fit your specific requirements for what constitutes a "page" in a .docx file [1][2].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu - cant we use SimpleDirectoryReader and add page_label.

dosubot[bot] commented 1 month ago

To add page_label for each page when reading a .docx file using SimpleDirectoryReader, you can specify a function that extracts metadata for each file and pass it as the file_metadata parameter. Here is an example of how you can achieve this:

from llama_index.core import SimpleDirectoryReader
from docx import Document as DocxDocument

def get_meta(file_path):
    # Open the .docx file and extract page labels
    doc = DocxDocument(file_path)
    page_labels = [f"Page {i+1}" for i in range(len(doc.paragraphs))]
    return {"page_labels": page_labels, "file_path": file_path}

reader = SimpleDirectoryReader(
    input_dir="path/to/directory",
    file_metadata=get_meta,
    required_exts=[".docx"]
)

documents = reader.load_data()
for doc in documents:
    print(doc.metadata)  # This will include the page labels

In this example, the get_meta function opens the .docx file, generates page labels for each page, and returns them as part of the metadata. The SimpleDirectoryReader then uses this function to attach the metadata to each Document object it creates [1][2].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu instead of using thelen(doc.paragraphs), can we use length of the pages in the range.

page_labels = [f"Page {i+1}" for i in range(len(doc.paragraphs))]

dosubot[bot] commented 1 month ago

To use the actual number of pages instead of the number of paragraphs for generating page_labels, you need a method to accurately determine the number of pages in a .docx file. The python-docx library does not provide a direct way to count pages, but you can estimate it based on sections or other criteria.

Here's an updated example that uses sections to estimate the number of pages:

from llama_index.core import SimpleDirectoryReader
from docx import Document as DocxDocument

def get_page_count(docx_path: str) -> int:
    doc = DocxDocument(docx_path)
    # Assuming each section starts on a new page
    return len(doc.element.xpath('//w:sectPr'))

def get_meta(file_path):
    # Open the .docx file and extract page labels
    page_count = get_page_count(file_path)
    page_labels = [f"Page {i+1}" for i in range(page_count)]
    return {"page_labels": page_labels, "file_path": file_path}

reader = SimpleDirectoryReader(
    input_dir="path/to/directory",
    file_metadata=get_meta,
    required_exts=[".docx"]
)

documents = reader.load_data()
for doc in documents:
    print(doc.metadata)  # This will include the page labels

In this example:

This approach will provide a more accurate estimation of the number of pages in the .docx file [1].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu the above code not able to count the page. It gives page_count as 1, even though i have 4 pages in my word doc file.

dosubot[bot] commented 1 month ago

The python-docx library does not provide a direct way to count the number of pages in a .docx file. However, you can use the python-docx library in combination with the docx2txt library to extract text and estimate the number of pages based on the text content.

Here's an updated approach using docx2txt to extract text and then splitting it into pages based on some heuristic (e.g., number of characters per page):

  1. Install the docx2txt library if you haven't already:

    pip install docx2txt
  2. Define a custom file_metadata function that estimates the number of pages based on the text content.

  3. Pass this custom function to the SimpleDirectoryReader.

Here is the updated code:

import docx2txt
from llama_index.core import SimpleDirectoryReader
from typing import Dict

def estimate_page_count(text: str, chars_per_page: int = 2000) -> int:
    # Estimate the number of pages based on the number of characters
    return max(1, len(text) // chars_per_page + (1 if len(text) % chars_per_page > 0 else 0))

def get_meta(file_path: str) -> Dict:
    metadata = {}
    if file_path.endswith('.docx'):
        text = docx2txt.process(file_path)
        page_count = estimate_page_count(text)
        metadata['page_labels'] = [f"Page {i+1}" for i in range(page_count)]
    return metadata

# Initialize the SimpleDirectoryReader with the custom file_metadata function
reader = SimpleDirectoryReader(
    input_dir='path/to/your/directory',
    file_metadata=get_meta,
    required_exts=[".docx", ".pdf"],
    filename_as_id=True
)

documents = reader.load_data()

# Print metadata for each document
for doc in documents:
    print(doc.metadata)

In this code:

This approach should provide a more accurate estimation of the number of pages in the .docx file [1][2].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 month ago

@dosu I want to know how llama-index handle .docx file using SimpleDirectoryReader. The same file in .pdf format gives the correct length of document and page_label, but when this same file is used .docx format, the document length is 1 and there is not page label as well:

from llama_index.core import Document, SimpleDirectoryReader filename_fn = lambda filename: {"file_name": filename} documents = SimpleDirectoryReader(data_path,
file_metadata=filename_fn,

recursive=True,

                               #num_files_limit= num_files_limit, 
                              required_exts=[".pdf"], 
                             filename_as_id=True).load_data()

print(f'length of documents : {len(documents)}')

dosubot[bot] commented 1 month ago

llama-index handles .docx files using the DocxReader class within the SimpleDirectoryReader. The discrepancy in document length and the absence of page_label for .docx files compared to .pdf files arises from differences in how these file formats are parsed and processed by their respective readers.

The PDFReader is designed to handle pagination and extract metadata such as page_label, while the DocxReader does not support these features, leading to the observed differences. Specifically, the PDFReader can return each page of a PDF as its own Document with associated page_label metadata, whereas the DocxReader processes the entire .docx file as a single Document without page-level metadata.

Here is an example of how SimpleDirectoryReader is used with DocxReader:

from llama_index.legacy.readers.file.docs_reader import DocxReader
from llama_index.legacy.readers.file.base import SimpleDirectoryReader

# Docx Reader example
parser = DocxReader()
file_extractor = {".docx": parser}
documents = SimpleDirectoryReader(
    input_dir="./data", file_extractor=file_extractor
).load_data()

The SimpleDirectoryReader automatically selects the best file reader given file extensions, and the DocxReader may not extract the same level of detail as the PDFReader. This results in the .docx file being treated as a single document with no page_label metadata, while the .pdf file is parsed with detailed pagination and metadata [1][2][3][4].

To continue talking to Dosu, mention @dosu.