jrmuizel / pdf-extract

A rust library for extracting content from pdfs
368 stars 75 forks source link

Consider supporting ActualText #41

Open badicsalex opened 1 year ago

badicsalex commented 1 year ago

I have several PDFs with some very weird ToUnicode mappings. Some characters get extracted as lowercase instead of uppercase, even though the CID corresponds to the ASCII uppercase version. Unfortunately this breaks later processing steps for these documents.

For example I have the following: https://stickman.hu/junk/actualtext_example.pdf

Here, the line

o) 1. mellékletében foglalt táblázat VII. címében az „1034/2011 és 1035/2011 EU rendeletek” szövegrész helyébe

Extracts as

o) 1. mellékletében foglalt táblázat VII. címében az „1034/2011 és 1035/2011 eU rendeletek” szövegrész helyébe
                                                                             ^
                                                                             |
                                                                         Lowercase

Note that with several PDF viewers (e.g. the firefox built-in one) will also copy the wrong text. Chrome, Okular, and poppler in general will capitalize the E in EU. pdftotext from the poppler suite also works OK.

Now why is this? For some reason, the CID for both E and e are mapped to the ASCII code point 101 (lowercase e) in the font.

Why is it handled OK by some extractors? Because this is what the actual operations look like around that part:

op: Operation { operator: "BDC", operands: [/Span, <</ActualText (��^@E)>>] }
op: Operation { operator: "Td", operands: [30.888, 0] }
op: Operation { operator: "Tj", operands: [(E)] }
op: Operation { operator: "EMC", operands: [] }

The ActualText thing here is described in the PDF standard "14.9.4 Replacement Text", and has a special code path in poppler: https://github.com/freedesktop/poppler/blob/315ab3006fb24bf47b595343e6a3e90995f2a588/poppler/Gfx.cc#L5052-L5059

As far as I see, handling this case would need some refactoring around show_text, and I'm really not sure how to do it. Probably a fully separate code path for the "simple" and the replacement text use-cases, both of which would call output_character in the end.

P.S. 1: It seems like this guy had a related issue back in the day: https://stackoverflow.com/questions/17737776/pdf-text-extraction-issue-font-capitalization-inconsistencies

P.S. 2: In the end, I might just expose the CID on the output_character interface and do the same workaround I did in python: https://github.com/badicsalex/hun_law_py/blob/master/hun_law/extractors/pdf.py#L88-L93

P.S. 3: Thanks for taking the time to fix some of the bugs I reported, I really appreciate it.