smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.37k stars 538 forks source link

'ti' ligature not parsed and/or displayed correctly #646

Open GreyWyvern opened 1 year ago

GreyWyvern commented 1 year ago

In fonts such as Calibri, the pair of glyphs 't' and 'i' are encoded as a 'ti' ligature when converted to PDF. However, I don't believe there is actually a code-point for a 'ti' ligature in UTF-8, and since PdfParser tries to convert all extracted text to UTF-8, it shows up as a missing code-point.

Example PDF: What in tarnation.pdf

Considering that trying to copy-paste the text right from the PDF also results in an unknown 'ti' ligature glyph, I'm not sure this issue can be fixed within PdfParser. But the fact that the PDF displays the glyph properly suggests that it may... ? It's possible this might be another Identity-H encoding issue.

The bytes encoding the ligature are (I believe): f480869f