smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.37k stars 538 forks source link

incorrect parsing tt and ti #663

Open mitchgthb opened 9 months ago

mitchgthb commented 9 months ago

It seems like the parser has trouble reading tt and ti when they're in between words. I get a symbol that has a question mark instead. What can I do?

k00ni commented 8 months ago

What PHP Version do you use?

Also, try again with PDFParser v 2.8.0-RC2. If you could provide the PDF which is causing the problem or example code instead (with faulty parameters), that would be helpful.

mitchgthb commented 8 months ago

Im using PHP 8.1. I tried using the parser version you mentioned aswell but it's not working. I will provide the code and the pdf.


Code: $file = './taken_sprint5.pdf';

$parser = new Parser(); $pdf = $parser->parseFile($file);

$text = $pdf->getText(); echo $text;


taken_sprint5.pdf

GreyWyvern commented 8 months ago

Related to (or duplicate of) #646. There is no UTF-8 code point for a 'ti' ligature (and maybe 'tt' as well?) so Adobe is using some unique encoding to provide for them. Probably an Identity-H issue.