Image background can cause text extraction to fail

pymupdf / RAG

RAG (Retrieval-Augmented Generation) Chatbot Examples Using PyMuPDF

https://pymupdf.readthedocs.io/en/latest/pymupdf4llm

GNU Affero General Public License v3.0

539 stars 82 forks source link

Image background can cause text extraction to fail #18

Closed maxjeblick closed 5 months ago

maxjeblick commented 6 months ago

Using https://corporate.goodyear.com/content/dam/goodyear-corp/documents/events-presentations/qtr4_2022_goodyear_investor_letter.pdf

import pymupdf4llm
import pathlib
md_text = pymupdf4llm.to_markdown("qtr4_2022_goodyear_investor_letter.pdf", pages=[2])
print(md_text)

gets -----. I checked that page.get_text() in get_page_output function returns the expected text from page 2. On that page, text_rects = column_boxes(... function returns an empty list. Note that there are several pages for which the text extraction fails.

This is on '0.0.3' version.

maxjeblick commented 6 months ago

I noticed https://github.com/pymupdf/RAG/blob/main/pymupdf4llm/pymupdf4llm/helpers/multi_column.py#L249 (ignore text written upon images) causes all blocks to be removed. I commented this functionality out, but extraction still fails.

I then removed all images (gs -o noimage.pdf -sDEVICE=pdfwrite -dFILTERIMAGE qtr4_2022_goodyear_investor_letter.pdf) and extraction worked as expected.

JorjMcKie commented 6 months ago

I noticed https://github.com/pymupdf/RAG/blob/main/pymupdf4llm/pymupdf4llm/helpers/multi_column.py#L249 (ignore text written upon images) causes all blocks to be removed. I commented this functionality out, but extraction still fails.

You were looking at the right place. In general, it often does make sense to ignore text written upon images, but even more so on vector graphics, because we will never get meaningful text that comments or extends things like Gantt chars and bar diagrams.

Ah - just saw that you found a workaround. Yes, I think I need to offer options to ignore images etc.

Thanks for the report anyway!

JorjMcKie commented 5 months ago

Fixed in version 0.0.4.