Closed maxjeblick closed 5 months ago
I noticed https://github.com/pymupdf/RAG/blob/main/pymupdf4llm/pymupdf4llm/helpers/multi_column.py#L249 (ignore text written upon images) causes all blocks to be removed. I commented this functionality out, but extraction still fails.
I then removed all images (gs -o noimage.pdf -sDEVICE=pdfwrite -dFILTERIMAGE qtr4_2022_goodyear_investor_letter.pdf
) and extraction worked as expected.
I noticed https://github.com/pymupdf/RAG/blob/main/pymupdf4llm/pymupdf4llm/helpers/multi_column.py#L249 (ignore text written upon images) causes all blocks to be removed. I commented this functionality out, but extraction still fails.
You were looking at the right place. In general, it often does make sense to ignore text written upon images, but even more so on vector graphics, because we will never get meaningful text that comments or extends things like Gantt chars and bar diagrams.
Ah - just saw that you found a workaround. Yes, I think I need to offer options to ignore images etc.
Thanks for the report anyway!
Fixed in version 0.0.4.
Using https://corporate.goodyear.com/content/dam/goodyear-corp/documents/events-presentations/qtr4_2022_goodyear_investor_letter.pdf
gets
-----
. I checked that page.get_text() inget_page_output
function returns the expected text from page 2. On that page,text_rects = column_boxes(...
function returns an empty list. Note that there are several pages for which the text extraction fails.This is on '0.0.3' version.