In fonts such as Calibri, the pair of glyphs 't' and 'i' are encoded as a 'ti' ligature when converted to PDF. However, I don't believe there is actually a code-point for a 'ti' ligature in UTF-8, and since PdfParser tries to convert all extracted text to UTF-8, it shows up as a missing code-point.
Considering that trying to copy-paste the text right from the PDF also results in an unknown 'ti' ligature glyph, I'm not sure this issue can be fixed within PdfParser. But the fact that the PDF displays the glyph properly suggests that it may... ? It's possible this might be another Identity-H encoding issue.
The bytes encoding the ligature are (I believe): f480869f
In fonts such as Calibri, the pair of glyphs 't' and 'i' are encoded as a 'ti' ligature when converted to PDF. However, I don't believe there is actually a code-point for a 'ti' ligature in UTF-8, and since PdfParser tries to convert all extracted text to UTF-8, it shows up as a missing code-point.
Example PDF: What in tarnation.pdf
Considering that trying to copy-paste the text right from the PDF also results in an unknown 'ti' ligature glyph, I'm not sure this issue can be fixed within PdfParser. But the fact that the PDF displays the glyph properly suggests that it may... ? It's possible this might be another Identity-H encoding issue.
The bytes encoding the ligature are (I believe): f480869f