smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 537 forks source link

Problems with getText() on PDF documents with UTF16BE encoding #734

Open SeedDMS opened 2 months ago

SeedDMS commented 2 months ago

Description:

PDF input

There is a file attached to a bug report of pdftotext https://gitlab.freedesktop.org/poppler/poppler/-/issues/332

2004.pdf

Expected output & actual output

The getText() output returns mostly utf16 encoding text, but it seems like there were non utf16 chars added by the parser. Besides that, I wonder if there is any way to determine which encoding is use? Or maybe, can the parser do a conversion to utf8?

Code

$parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile($infile); $t = $pdf->getText();