Open adityamity opened 1 month ago
To add page numbers to the metadata of a Word document (.docx) using the SimpleDirectoryReader
, you need to define a custom file_metadata
function that extracts the page numbers from the document and includes them in the metadata dictionary. Here is an example of how you can achieve this:
Install the python-docx
library if you haven't already:
pip install python-docx
Define a custom file_metadata
function that extracts the page numbers from the Word document.
Pass this custom function to the SimpleDirectoryReader
.
Here is the code to accomplish this:
from docx import Document
from llama_index.core.readers.file.base import SimpleDirectoryReader
from typing import Dict
def get_page_count(docx_path: str) -> int:
doc = Document(docx_path)
# Assuming each section starts on a new page
return len(doc.element.xpath('//w:sectPr'))
def custom_file_metadata(file_path: str) -> Dict:
metadata = {}
if file_path.endswith('.docx'):
metadata['page_count'] = get_page_count(file_path)
return metadata
# Initialize the SimpleDirectoryReader with the custom file_metadata function
reader = SimpleDirectoryReader(
input_dir='path/to/your/directory',
file_metadata=custom_file_metadata,
required_exts=[".docx", ".pdf"],
filename_as_id=True
)
documents = reader.load_data()
In this code:
get_page_count
function uses python-docx
to count the number of sections in the document, assuming each section starts on a new page.custom_file_metadata
function adds the page count to the metadata dictionary if the file is a .docx
file.SimpleDirectoryReader
is initialized with the custom_file_metadata
function to include page numbers in the metadata.This will ensure that the page numbers are added to the metadata of each Word document processed by the SimpleDirectoryReader
[1].
To continue talking to Dosu, mention @dosu.
@dosu I am facing issue again. I have one file with different file type(.docx and .pdf) kept on the same path. The file contain 4 pages. When i read only .pdf file using the below code:
from llama_index.core import Document, SimpleDirectoryReader
filename_fn = lambda filename: {"file_name": filename}
documents = SimpleDirectoryReader(data_path,
file_metadata=filename_fn,
#recursive=True,
#num_files_limit= num_files_limit,
required_exts=[".pdf"],
filename_as_id=True).load_data()
print(f'length of documents : {len(documents)}')
length of documents is coming as 4 and i got the below metadata:
documents[0].metadata
{'page_label': '1',
'file_name': file_name}
But i read the .docx file using the same code, i see the page_label
is missing in the metadata.
How can i get the page_label
for all the pages when reading the docx file.
To get the page_label
for all pages when reading a .docx
file, you need to customize the file_metadata
function to include page numbers. However, unlike PDFs, .docx
files do not inherently contain page number information. You can estimate page numbers based on the content length or other criteria.
Here's how you can modify your code to include page_label
for .docx
files:
Install the python-docx
library if you haven't already:
pip install python-docx
Define a custom file_metadata
function that estimates page numbers for .docx
files.
Pass this custom function to the SimpleDirectoryReader
.
Here is the updated code:
from docx import Document as DocxDocument
from llama_index.core.readers.file.base import SimpleDirectoryReader
from typing import Dict
def get_page_count(docx_path: str) -> int:
doc = DocxDocument(docx_path)
# Assuming each section starts on a new page
return len(doc.element.xpath('//w:sectPr'))
def custom_file_metadata(file_path: str) -> Dict:
metadata = {}
if file_path.endswith('.docx'):
page_count = get_page_count(file_path)
metadata['page_label'] = [str(i + 1) for i in range(page_count)]
return metadata
# Initialize the SimpleDirectoryReader with the custom file_metadata function
reader = SimpleDirectoryReader(
input_dir='path/to/your/directory',
file_metadata=custom_file_metadata,
required_exts=[".docx", ".pdf"],
filename_as_id=True
)
documents = reader.load_data()
# Print metadata for each document
for doc in documents:
print(doc.metadata)
In this code:
get_page_count
function uses python-docx
to count the number of sections in the document, assuming each section starts on a new page.custom_file_metadata
function adds the page_label
to the metadata dictionary if the file is a .docx
file.SimpleDirectoryReader
is initialized with the custom_file_metadata
function to include page numbers in the metadata.This will ensure that the page_label
is added to the metadata of each .docx
file processed by the SimpleDirectoryReader
[1][2].
To continue talking to Dosu, mention @dosu.
@dosu the above code is not working. When i run the above code for .docx file. The length of the document is coming as 1, however the document length is 4 as i have 4 pages in my document. I am getting the below result:
documents[0].metadata
{'file_name': 'file1.docx', 'page_label': ['1']}
To correctly get the page_label
for all pages when reading a .docx
file, you need to iterate over each page and extract the page_label
for each one. The current implementation returns only one document with a single page label because it doesn't split the document into pages.
Here's an updated version of the DocxReader
class that includes page-level extraction and labeling:
from pathlib import Path
from typing import List, Optional, Dict
from llama_index.core import Document
from llama_index.readers.file.base import BaseReader
class DocxReader(BaseReader):
"""Docx parser."""
def load_data(
self,
file: Path,
extra_info: Optional[Dict] = None,
) -> List[Document]:
"""Parse file."""
try:
import docx
except ImportError:
raise ImportError(
"python-docx is required to read Microsoft Word files: "
"`pip install python-docx`"
)
doc = docx.Document(file)
docs = []
for i, paragraph in enumerate(doc.paragraphs):
page_text = paragraph.text
page_label = f"Page {i + 1}"
metadata = {"page_label": page_label, "file_name": file.name}
if extra_info is not None:
metadata.update(extra_info)
docs.append(Document(text=page_text, metadata=metadata))
return docs
# Usage example
file_path = Path("path_to_your_docx_file.docx")
loader = DocxReader()
documents = loader.load_data(file_path)
# Print metadata for each document
for doc in documents:
print(doc.metadata)
This code uses the python-docx
library to read the .docx
file and iterates over each paragraph, treating each paragraph as a separate "page" for the purpose of labeling. Each paragraph is assigned a page_label
based on its order in the document. Adjust the logic as needed to fit your specific requirements for what constitutes a "page" in a .docx
file [1][2].
To continue talking to Dosu, mention @dosu.
@dosu - cant we use SimpleDirectoryReader and add page_label.
To add page_label
for each page when reading a .docx
file using SimpleDirectoryReader
, you can specify a function that extracts metadata for each file and pass it as the file_metadata
parameter. Here is an example of how you can achieve this:
from llama_index.core import SimpleDirectoryReader
from docx import Document as DocxDocument
def get_meta(file_path):
# Open the .docx file and extract page labels
doc = DocxDocument(file_path)
page_labels = [f"Page {i+1}" for i in range(len(doc.paragraphs))]
return {"page_labels": page_labels, "file_path": file_path}
reader = SimpleDirectoryReader(
input_dir="path/to/directory",
file_metadata=get_meta,
required_exts=[".docx"]
)
documents = reader.load_data()
for doc in documents:
print(doc.metadata) # This will include the page labels
In this example, the get_meta
function opens the .docx
file, generates page labels for each page, and returns them as part of the metadata. The SimpleDirectoryReader
then uses this function to attach the metadata to each Document
object it creates [1][2].
To continue talking to Dosu, mention @dosu.
@dosu instead of using thelen(doc.paragraphs)
, can we use length of the pages in the range.
page_labels = [f"Page {i+1}" for i in range(len(doc.paragraphs))]
To use the actual number of pages instead of the number of paragraphs for generating page_labels
, you need a method to accurately determine the number of pages in a .docx
file. The python-docx
library does not provide a direct way to count pages, but you can estimate it based on sections or other criteria.
Here's an updated example that uses sections to estimate the number of pages:
from llama_index.core import SimpleDirectoryReader
from docx import Document as DocxDocument
def get_page_count(docx_path: str) -> int:
doc = DocxDocument(docx_path)
# Assuming each section starts on a new page
return len(doc.element.xpath('//w:sectPr'))
def get_meta(file_path):
# Open the .docx file and extract page labels
page_count = get_page_count(file_path)
page_labels = [f"Page {i+1}" for i in range(page_count)]
return {"page_labels": page_labels, "file_path": file_path}
reader = SimpleDirectoryReader(
input_dir="path/to/directory",
file_metadata=get_meta,
required_exts=[".docx"]
)
documents = reader.load_data()
for doc in documents:
print(doc.metadata) # This will include the page labels
In this example:
get_page_count
function uses python-docx
to count the number of sections in the document, assuming each section starts on a new page.get_meta
function generates page_labels
based on the number of pages and includes them in the metadata.This approach will provide a more accurate estimation of the number of pages in the .docx
file [1].
To continue talking to Dosu, mention @dosu.
@dosu the above code not able to count the page. It gives page_count as 1, even though i have 4 pages in my word doc file.
The python-docx
library does not provide a direct way to count the number of pages in a .docx
file. However, you can use the python-docx
library in combination with the docx2txt
library to extract text and estimate the number of pages based on the text content.
Here's an updated approach using docx2txt
to extract text and then splitting it into pages based on some heuristic (e.g., number of characters per page):
Install the docx2txt
library if you haven't already:
pip install docx2txt
Define a custom file_metadata
function that estimates the number of pages based on the text content.
Pass this custom function to the SimpleDirectoryReader
.
Here is the updated code:
import docx2txt
from llama_index.core import SimpleDirectoryReader
from typing import Dict
def estimate_page_count(text: str, chars_per_page: int = 2000) -> int:
# Estimate the number of pages based on the number of characters
return max(1, len(text) // chars_per_page + (1 if len(text) % chars_per_page > 0 else 0))
def get_meta(file_path: str) -> Dict:
metadata = {}
if file_path.endswith('.docx'):
text = docx2txt.process(file_path)
page_count = estimate_page_count(text)
metadata['page_labels'] = [f"Page {i+1}" for i in range(page_count)]
return metadata
# Initialize the SimpleDirectoryReader with the custom file_metadata function
reader = SimpleDirectoryReader(
input_dir='path/to/your/directory',
file_metadata=get_meta,
required_exts=[".docx", ".pdf"],
filename_as_id=True
)
documents = reader.load_data()
# Print metadata for each document
for doc in documents:
print(doc.metadata)
In this code:
estimate_page_count
function estimates the number of pages based on the number of characters in the text, assuming an average of 2000 characters per page. You can adjust the chars_per_page
parameter based on your specific needs.get_meta
function uses docx2txt
to extract the text from the .docx
file and then estimates the number of pages, adding page_labels
to the metadata.This approach should provide a more accurate estimation of the number of pages in the .docx
file [1][2].
To continue talking to Dosu, mention @dosu.
@dosu I want to know how llama-index handle .docx
file using SimpleDirectoryReader
. The same file in .pdf format gives the correct length of document and page_label, but when this same file is used .docx format, the document length is 1 and there is not page label as well:
from llama_index.core import Document, SimpleDirectoryReader
filename_fn = lambda filename: {"file_name": filename}
documents = SimpleDirectoryReader(data_path,
file_metadata=filename_fn,
#num_files_limit= num_files_limit,
required_exts=[".pdf"],
filename_as_id=True).load_data()
print(f'length of documents : {len(documents)}')
llama-index
handles .docx
files using the DocxReader
class within the SimpleDirectoryReader
. The discrepancy in document length and the absence of page_label
for .docx
files compared to .pdf
files arises from differences in how these file formats are parsed and processed by their respective readers.
The PDFReader
is designed to handle pagination and extract metadata such as page_label
, while the DocxReader
does not support these features, leading to the observed differences. Specifically, the PDFReader
can return each page of a PDF as its own Document
with associated page_label
metadata, whereas the DocxReader
processes the entire .docx
file as a single Document
without page-level metadata.
Here is an example of how SimpleDirectoryReader
is used with DocxReader
:
from llama_index.legacy.readers.file.docs_reader import DocxReader
from llama_index.legacy.readers.file.base import SimpleDirectoryReader
# Docx Reader example
parser = DocxReader()
file_extractor = {".docx": parser}
documents = SimpleDirectoryReader(
input_dir="./data", file_extractor=file_extractor
).load_data()
The SimpleDirectoryReader
automatically selects the best file reader given file extensions, and the DocxReader
may not extract the same level of detail as the PDFReader
. This results in the .docx
file being treated as a single document with no page_label
metadata, while the .pdf
file is parsed with detailed pagination and metadata [1][2][3][4].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
I am reading word file (.docx) using the below code:
How can i add page number to this document metadata?