Garbled extraction for Amazon Sustainability Report

pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.

https://pymupdf.readthedocs.io

GNU Affero General Public License v3.0

4.49k stars 443 forks source link

Garbled extraction for Amazon Sustainability Report #3594

Closed gtmtech closed 5 days ago

gtmtech commented 1 week ago

Description of the bug

Using PyMuPDF or PyMuPDF4LLM to extract the Amazon Sustainability Report gives quite incomprehensible output with beginnings of words garbled, and spacing screwed up. Something about this PDF is not being parsed properly.

I have found other extractors do extract this report correctly.

How to reproduce the bug

import fitz
d = fitz.open("file.pdf")
for p in d.pages()
    print(p.get_text())

07ef2453.pdf

PyMuPDF version

1.24.2

Operating system

MacOS

Python version

3.11

gtmtech commented 1 week ago

Showing example of some of the extraction it does showing the problem

JorjMcKie commented 1 week ago

page0.pdf page0-1.pdf

JorjMcKie commented 1 week ago

The Amazon font(s) contain non-standard encodings. In such cases, the success of text extraction is subject to good luck. Some extractors may be better than others in guessing what the right Unicode for a given glyph may be.

The weird characters stem from the default flag bit TEXT_CID_FOR_UNKNOWN_UNICODE. When on, MuPDF interprets the extracted glyph number as the Unicode of a character. This often helps in these problem cases - but not always, like here. You can set it off - either completely with all other flag bits via flags=0 or selectively via e.g. flags=pymupdf.TEXTFLAGS_TEXT & ~pymupdf.TEXT_CID_FOR_UNKNOWN_UNICODE. Any non-existing Unicode will then be reported as "�".

I also have submitted a question to the MuPDF team for any other advices they may have.

gtmtech commented 1 week ago

Thanks for the feedback @JorjMcKie - if it helps, I found homebrew's pdftohtml with npm's html-to-text gets a successful extraction of this doc.

Hence I believe pdftohtml's decoding of this PDF might contain some interesting alternative logic for this specific case

JorjMcKie commented 1 week ago

Thanks for the feedback @JorjMcKie - if it helps, I found homebrew's pdftohtml with npm's html-to-text gets a successful extraction of this doc.

Hence I believe pdftohtml's decoding of this PDF might contain some interesting alternative logic for this specific case

As @julian-smith-artifex-com wrote: there is a fix (implemented in our base library MuPDF) underway that will solve this problem in the next release.

julian-smith-artifex-com commented 5 days ago

Fixed in 1.24.6.