Open danielrbrowne opened 4 years ago
Hello @danielrbrowne !
This error is usually always coming from embedded font in the PDF with custom encoding. The unicode for €
is actually used by the embedded font to refer to a glyph which is (visually) the diaeresis ¨
, so not the right unicode character code. As a consequence, the extracted character is the correct unicode character for this code, but not the character corresponding to the referred glyph.
For instance doing:
lopez@work:~$ pdffonts ~/Downloads/NON.ASCII-Manuscript.pdf
name type encoding emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
MFAMCJ+AdvOTee640837.B Type 1C Custom yes yes yes 14 0
MFAMAE+AdvPSMySB Type 1C WinAnsi yes yes no 33 0
MFAMAF+AdvP4C4E59 Type 1C Custom yes yes yes 10 0
MFAMAG+AdvMyriad-RS Type 1C Custom yes yes no 8 0
MFAMBG+AdvMyriad-IS Type 1C WinAnsi yes yes no 9 0
MFAMBH+AdvP80516 Type 1C Custom yes yes yes 135 0
MFAMBI+AdvMyriad-BS Type 1C WinAnsi yes yes no 13 0
MFAMHJ+AdvULTSO-S Type 1C WinAnsi yes yes no 11 0
MFAMHI+AdvULTS-S Type 1C WinAnsi yes yes no 12 0
MFAPDI+AdvPSMySBI Type 1C WinAnsi yes yes no 69 0
MFBEDI+AdvP4C4E74 Type 1C Custom yes yes no 95 0
The title text uses AdvPSMySB
font so a WinAnsi encoding, and unicode is matching the glyphs code, everything fine as it should always be. But the diaresis character uses the font AdvP4C4E59
with custom encoding, so bad luck the code is not valid unicode, it's just for referencing an embedded glyph...
The fact that some tools will then put the ¨
/€
before or after the o
is a bit random, actually the coordinates locate the two characters at the same place (there are combined to create the ö
normally). With GROBID it follows the actual PDF stream order.
How to solve this issue? It's the worst case scenario, because 1) it's very hard to detect the problem because usually most of the time custom encoding the font follows unicode 2) even with an OCR, we would need to check in this case every glyphs of the embedded fonts with custom encoding which would create new errors due to OCR accuracy...
In the good "unsolved unicode" cases, we have an invalid unicode or something in the free unicode range (the right way to refer to embedded fonts) and we know when to apply OCR to recover unsolved encoding. MacOS Preview for instance is able usually to solves these "good" cases, because it has proprietary fonts and/or some fancy OCR processing in background to recover that - but it cannot solve this worst case scenario (apart Preview, no PDF tool is able to solve even the "good" case afaik).
Using an OCR for recovering encoding issues is still a long term goal of pdfalto
(the tool used to parse the pdf in Grobid), it remains challenging of course to maintain high processing speed but we would like to be at least as good as MacOS Preview. However here your issue is even harder than usual glyph encoding errors, so it might remain still unsolved after we retire :D
I've found a PDF where a title containing an 'ö' gets decoded into '€ o'. This is when I use the 'api/processHeaderDocument' endpoint with no additonal headers. Am I missing anything RE: configuration headers for character encoding (or similar) or is this simply a bug (or potentially a specific issue in this PDF, see below)?
See attached for the PDF which generates the following XML excerpt:
where the title as rendered in the PDF is 'P.-O. Löwdin and the International Journal of Quantum Chemistry: A Kaleidoscopic Agenda for Quantum Chemistry'.
N.B: Interestingly, when I copied that title directly from the PDF when opening it Preview on macOS, I get the string 'P.-O. Lo€wdin and the International Journal of Quantum Chemistry: A Kaleidoscopic Agenda for Quantum Chemistry'. This is obviously different to how Grobid is decoding the text (i.e. the '€' character is after the 'o' rather than before it as with Grobid), but it might be related. NON ASCII-Manuscript.pdf