getText() may return Non-ASCII/UTF-8 characters

mjclemente / pdfbox.cfc

Utilize the PDFBox Java library to manipulate PDFs with CFML

MIT License

24 stars 4 forks source link

getText() may return Non-ASCII/UTF-8 characters #7

Open JamoCA opened 1 year ago

JamoCA commented 1 year ago

PDFBox wasn't correctly parsing some italic yellow-on-brown text (not my PDF) and returned INFORMATION SEPARATOR ONE u001f (for Th) and INFORMATION SEPARATOR TWO u001e (for fi).

I normally use a java junidecode library to convert UTF-8 to ASCII7, but this wasn't working with these characters. I'd rather not have odd control-type characters in the text, so I used the following regex to strip high ASCII. (I figured the text was already wrong and it'd be better to omit these characters rather than retain them.)

text = rereplace(text, "[^\x20-\x7E]", "", "all");

Have you come across this issue before?

JamoCA commented 1 year ago

I'm comparing the PDFBox 2.0.27 results against third-party services to see what they are capable of.

The Minion Pro text on the PDF (using FoxIt PDF Reader) appears in italics as: The Definitive Expert in Carmel ... but selecting & copying it returns the following when pasted into VSCode; e De native Expert in Carmel

NOTE: It's possible that these characters are font-specific ligatures.

PDF2go correctly identified the text (using an OCR method) without munging any characters. The Definitive Expert in Carmel

PDFCandy returned odd spacing: Th e Defi native Expert in Carmel

PDFForge worked: The Definitive Expert in Carmel

PDFtk has multiple options, but also failed. e De native Expert in Carmel

mjclemente commented 1 year ago

Interesting - I haven't encountered this before.

I'm planning on upgrading the jar to 2.0.28 soon - wondering if that will make any difference.

If it doesn't, maybe it makes sense to include an option in getText() to strip high ASCII characters.

mjclemente commented 9 months ago

Should have asked earlier on this - do you have an example pdf with the issue for testing?

JamoCA commented 9 months ago

I visited your blog and sent a private email with the link to the PDF that initially encountered these issues.