Open cxycxm opened 2 weeks ago
To ignore images in DOC or PDF documents, you can use the html2text
library and set the ignore_images
attribute to True
. Here's an example:
import html2text
text_maker = html2text.HTML2Text()
text_maker.ignore_links = True
text_maker.ignore_images = True
By setting text_maker.ignore_images = True
, the images in the documents will be ignored during the processing [1].
To continue talking to Dosu, mention @dosu.
How to use with SimpleDirectoryReader
To use SimpleDirectoryReader
with DOC or PDF files and ignore images, you can customize the file readers to preprocess the files before reading them. Here's an example of how to achieve this:
DocxReader
and PDFReader
classes to preprocess the text and ignore images.SimpleDirectoryReader
.Here's a sample implementation:
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import DocxReader, PDFReader
from llama_index.legacy.schema import Document
class CustomDocxReader(DocxReader):
def load_data(self, file, extra_info=None):
documents = super().load_data(file, extra_info)
for doc in documents:
doc.text = self._preprocess_text(doc.text)
return documents
def _preprocess_text(self, text):
# Implement your logic to remove images or unwanted content
return text.replace("image_placeholder", "")
class CustomPDFReader(PDFReader):
def load_data(self, file, extra_info=None):
documents = super().load_data(file, extra_info)
for doc in documents:
doc.text = self._preprocess_text(doc.text)
return documents
def _preprocess_text(self, text):
# Implement your logic to remove images or unwanted content
return text.replace("image_placeholder", "")
# Use SimpleDirectoryReader with custom readers
file_extractor = {
".docx": CustomDocxReader(),
".pdf": CustomPDFReader()
}
documents = SimpleDirectoryReader(
"./data", file_extractor=file_extractor
).load_data()
# Print the text of the first document
print(documents[0].text)
In this example, CustomDocxReader
and CustomPDFReader
preprocess the text to remove images or other unwanted content before returning the Document
objects. You can customize the _preprocess_text
method to fit your specific needs [1][2].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
How to deal with images in doc or pdf? Is there a way to ignore these images?