kermitt2 / grobid

A machine learning software for extracting information from scholarly documents
https://grobid.readthedocs.io
Apache License 2.0
3.26k stars 439 forks source link

Incorrectly-decoded characters when parsing non-ascii characters from source PDF #518

Open danielrbrowne opened 4 years ago

danielrbrowne commented 4 years ago

I've found a PDF where a title containing an 'ö' gets decoded into '€ o'. This is when I use the 'api/processHeaderDocument' endpoint with no additonal headers. Am I missing anything RE: configuration headers for character encoding (or similar) or is this simply a bug (or potentially a specific issue in this PDF, see below)?

See attached for the PDF which generates the following XML excerpt:

<titleStmt>
    <title level="a" type="main">P.-O. L € owdin and the International Journal of Quantum Chemistry: A Kaleidoscopic Agenda for Quantum Chemistry</title>
</titleStmt>

where the title as rendered in the PDF is 'P.-O. Löwdin and the International Journal of Quantum Chemistry: A Kaleidoscopic Agenda for Quantum Chemistry'.

N.B: Interestingly, when I copied that title directly from the PDF when opening it Preview on macOS, I get the string 'P.-O. Lo€wdin and the International Journal of Quantum Chemistry: A Kaleidoscopic Agenda for Quantum Chemistry'. This is obviously different to how Grobid is decoding the text (i.e. the '€' character is after the 'o' rather than before it as with Grobid), but it might be related. NON ASCII-Manuscript.pdf

kermitt2 commented 4 years ago

Hello @danielrbrowne !

This error is usually always coming from embedded font in the PDF with custom encoding. The unicode for is actually used by the embedded font to refer to a glyph which is (visually) the diaeresis ¨, so not the right unicode character code. As a consequence, the extracted character is the correct unicode character for this code, but not the character corresponding to the referred glyph.

For instance doing:

lopez@work:~$ pdffonts  ~/Downloads/NON.ASCII-Manuscript.pdf 
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
MFAMCJ+AdvOTee640837.B               Type 1C           Custom           yes yes yes     14  0
MFAMAE+AdvPSMySB                     Type 1C           WinAnsi          yes yes no      33  0
MFAMAF+AdvP4C4E59                    Type 1C           Custom           yes yes yes     10  0
MFAMAG+AdvMyriad-RS                  Type 1C           Custom           yes yes no       8  0
MFAMBG+AdvMyriad-IS                  Type 1C           WinAnsi          yes yes no       9  0
MFAMBH+AdvP80516                     Type 1C           Custom           yes yes yes    135  0
MFAMBI+AdvMyriad-BS                  Type 1C           WinAnsi          yes yes no      13  0
MFAMHJ+AdvULTSO-S                    Type 1C           WinAnsi          yes yes no      11  0
MFAMHI+AdvULTS-S                     Type 1C           WinAnsi          yes yes no      12  0
MFAPDI+AdvPSMySBI                    Type 1C           WinAnsi          yes yes no      69  0
MFBEDI+AdvP4C4E74                    Type 1C           Custom           yes yes no      95  0

The title text uses AdvPSMySB font so a WinAnsi encoding, and unicode is matching the glyphs code, everything fine as it should always be. But the diaresis character uses the font AdvP4C4E59 with custom encoding, so bad luck the code is not valid unicode, it's just for referencing an embedded glyph...

The fact that some tools will then put the ¨/ before or after the o is a bit random, actually the coordinates locate the two characters at the same place (there are combined to create the ö normally). With GROBID it follows the actual PDF stream order.

How to solve this issue? It's the worst case scenario, because 1) it's very hard to detect the problem because usually most of the time custom encoding the font follows unicode 2) even with an OCR, we would need to check in this case every glyphs of the embedded fonts with custom encoding which would create new errors due to OCR accuracy...

In the good "unsolved unicode" cases, we have an invalid unicode or something in the free unicode range (the right way to refer to embedded fonts) and we know when to apply OCR to recover unsolved encoding. MacOS Preview for instance is able usually to solves these "good" cases, because it has proprietary fonts and/or some fancy OCR processing in background to recover that - but it cannot solve this worst case scenario (apart Preview, no PDF tool is able to solve even the "good" case afaik).

Using an OCR for recovering encoding issues is still a long term goal of pdfalto (the tool used to parse the pdf in Grobid), it remains challenging of course to maintain high processing speed but we would like to be at least as good as MacOS Preview. However here your issue is even harder than usual glyph encoding errors, so it might remain still unsolved after we retire :D