Problems with getText() on PDF documents with UTF16BE encoding

PHP Version: 8.2
PDFParser Version: 2.11.0

Description:

PDF input

There is a file attached to a bug report of pdftotext https://gitlab.freedesktop.org/poppler/poppler/-/issues/332

2004.pdf

Expected output & actual output

The getText() output returns mostly utf16 encoding text, but it seems like there were non utf16 chars added by the parser. Besides that, I wonder if there is any way to determine which encoding is use? Or maybe, can the parser do a conversion to utf8?

Code

$parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile($infile); $t = $pdf->getText();

smalot / pdfparser

Problems with getText() on PDF documents with UTF16BE encoding #734

Description:

PDF input

Expected output & actual output

Code