[Question]: How to deal with images in doc or pdf? Is there a way to ignore these images?

cxycxm commented 2 weeks ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

How to deal with images in doc or pdf? Is there a way to ignore these images?

dosubot[bot] commented 2 weeks ago

To ignore images in DOC or PDF documents, you can use the html2text library and set the ignore_images attribute to True. Here's an example:

import html2text

text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text_maker.ignore_images = True

By setting text_maker.ignore_images = True, the images in the documents will be ignored during the processing ^[1].

To continue talking to Dosu, mention @dosu.

cxycxm commented 2 weeks ago

How to use with SimpleDirectoryReader

dosubot[bot] commented 2 weeks ago

To use SimpleDirectoryReader with DOC or PDF files and ignore images, you can customize the file readers to preprocess the files before reading them. Here's an example of how to achieve this:

Create Custom Readers: Extend the DocxReader and PDFReader classes to preprocess the text and ignore images.
Use Custom Readers with SimpleDirectoryReader.

Here's a sample implementation:

from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import DocxReader, PDFReader
from llama_index.legacy.schema import Document

class CustomDocxReader(DocxReader):
    def load_data(self, file, extra_info=None):
        documents = super().load_data(file, extra_info)
        for doc in documents:
            doc.text = self._preprocess_text(doc.text)
        return documents

    def _preprocess_text(self, text):
        # Implement your logic to remove images or unwanted content
        return text.replace("image_placeholder", "")

class CustomPDFReader(PDFReader):
    def load_data(self, file, extra_info=None):
        documents = super().load_data(file, extra_info)
        for doc in documents:
            doc.text = self._preprocess_text(doc.text)
        return documents

    def _preprocess_text(self, text):
        # Implement your logic to remove images or unwanted content
        return text.replace("image_placeholder", "")

# Use SimpleDirectoryReader with custom readers
file_extractor = {
    ".docx": CustomDocxReader(),
    ".pdf": CustomPDFReader()
}

documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()

# Print the text of the first document
print(documents[0].text)

In this example, CustomDocxReader and CustomPDFReader preprocess the text to remove images or other unwanted content before returning the Document objects. You can customize the _preprocess_text method to fit your specific needs ^[1]^[2].

To continue talking to Dosu, mention @dosu.

run-llama / llama_index

[Question]: How to deal with images in doc or pdf? Is there a way to ignore these images? #15656

Question Validation

Question