The getText() output returns mostly utf16 encoding text, but it seems like there were non utf16 chars added by the parser.
Besides that, I wonder if there is any way to determine which encoding is use? Or maybe, can the parser do a conversion to utf8?
Code
$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile($infile);
$t = $pdf->getText();
Description:
PDF input
There is a file attached to a bug report of pdftotext https://gitlab.freedesktop.org/poppler/poppler/-/issues/332
2004.pdf
Expected output & actual output
The getText() output returns mostly utf16 encoding text, but it seems like there were non utf16 chars added by the parser. Besides that, I wonder if there is any way to determine which encoding is use? Or maybe, can the parser do a conversion to utf8?
Code
$parser = new \Smalot\PdfParser\Parser(); $pdf = $parser->parseFile($infile); $t = $pdf->getText();