Closed terrafrost closed 5 months ago
Please try again with v2.8.0-RC2 and get back to us. Thanks.
Still an issue.
I can't seem to reproduce this error with the test.pdf given, neither in 2.7.0 nor in 2.8.0-RC2. The test.pdf had no "dots" in the field, so I also added two periods "..", saved the file and tried it again in both versions. The dots were not extracted by getText()
but there was no error.
Maybe you could quickly sanity-check this @k00ni ? Or the OP can post a different test file?
Never mind, I see the error now. I wasn't using getObjects()
properly. Looking at it now.
OK. This should be a quick fix. The test.pdf document doesn't contain an encoding value, so a default must be assumed.
According to the PDF Reference 1.7, the default encoding should be 'StandardEncoding'. PdfParser currently does not supply any default, so when it queries for the BaseEncoding, it returns an empty string.
Should be as simple as inserting a check in Encoding->getEncodingClass()
:
/**
* @throws EncodingNotFoundException
*/
protected function getEncodingClass(): string
{
// Load reference table charset.
$baseEncoding = preg_replace('/[^A-Z0-9]/is', '', $this->get('BaseEncoding')->getContent());
// Check for empty BaseEncoding
if ('' == $baseEncoding) $baseEncoding = 'StandardEncoding';
$className = '\\Smalot\\PdfParser\\Encoding\\'.$baseEncoding;
if (!class_exists($className)) {
throw new EncodingNotFoundException('Missing encoding data for: "'.$baseEncoding.'".');
}
return $className;
}
Are we able to use your document in the PdfParser test suite, @terrafrost ?
Are we able to use your document in the PdfParser test suite, @terrafrost ?
Feel free!
Description:
Attempting to parse a PDF with a form field with two dots in it causes pdfparser to throw an exception. Adobe Acrobat reads the PDF just fine as does
qpdf test.pdf --qdf test.qdf
.PDF input
test.pdf
Expected output & actual output
The expected output is for it not to throw an exception. The actual output is an exception:
Code