Open JamoCA opened 1 year ago
I'm comparing the PDFBox 2.0.27 results against third-party services to see what they are capable of.
The Minion Pro text on the PDF (using FoxIt PDF Reader) appears in italics as:
The Definitive Expert in Carmel
... but selecting & copying it returns the following when pasted into VSCode;
e De native Expert in Carmel
NOTE: It's possible that these characters are font-specific ligatures.
PDF2go correctly identified the text (using an OCR method) without munging any characters.
The Definitive Expert in Carmel
PDFCandy returned odd spacing:
Th e Defi native Expert in Carmel
PDFForge worked:
The Definitive Expert in Carmel
PDFtk has multiple options, but also failed.
e De native Expert in Carmel
Interesting - I haven't encountered this before.
I'm planning on upgrading the jar to 2.0.28 soon - wondering if that will make any difference.
If it doesn't, maybe it makes sense to include an option in getText()
to strip high ASCII characters.
Should have asked earlier on this - do you have an example pdf with the issue for testing?
I visited your blog and sent a private email with the link to the PDF that initially encountered these issues.
PDFBox wasn't correctly parsing some italic yellow-on-brown text (not my PDF) and returned INFORMATION SEPARATOR ONE
u001f
(forTh
) and INFORMATION SEPARATOR TWOu001e
(forfi
).I normally use a java junidecode library to convert UTF-8 to ASCII7, but this wasn't working with these characters. I'd rather not have odd control-type characters in the text, so I used the following regex to strip high ASCII. (I figured the text was already wrong and it'd be better to omit these characters rather than retain them.)
Have you come across this issue before?