Attempting to parse a PDF with a form field with two dots in it causes pdfparser to throw an exception

smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.

GNU Lesser General Public License v3.0

2.3k stars 534 forks source link

Attempting to parse a PDF with a form field with two dots in it causes pdfparser to throw an exception #665

Closed terrafrost closed 5 months ago

terrafrost commented 5 months ago

PHP Version: 8.3.1
PDFParser Version: 2.7.0

Description:

Attempting to parse a PDF with a form field with two dots in it causes pdfparser to throw an exception. Adobe Acrobat reads the PDF just fine as does qpdf test.pdf --qdf test.qdf.

PDF input

test.pdf

Expected output & actual output

The expected output is for it not to throw an exception. The actual output is an exception:

Fatal error: Uncaught Smalot\PdfParser\Exception\EncodingNotFoundException: Missing encoding data for: ""

Code

<?php
require __DIR__ . '/vendor/autoload.php';

$parser = new \Smalot\PdfParser\Parser();
$pdf = $parser->parseFile('test.pdf');

$objects = $pdf->getObjects();
foreach ($objects as $obj) {
    print_r($obj->getDetails());
}

k00ni commented 5 months ago

Please try again with v2.8.0-RC2 and get back to us. Thanks.

terrafrost commented 5 months ago

Still an issue.

GreyWyvern commented 5 months ago

I can't seem to reproduce this error with the test.pdf given, neither in 2.7.0 nor in 2.8.0-RC2. The test.pdf had no "dots" in the field, so I also added two periods "..", saved the file and tried it again in both versions. The dots were not extracted by getText() but there was no error.

~~Maybe you could quickly sanity-check this @k00ni ? Or the OP can post a different test file?~~

Never mind, I see the error now. I wasn't using getObjects() properly. Looking at it now.

GreyWyvern commented 5 months ago

OK. This should be a quick fix. The test.pdf document doesn't contain an encoding value, so a default must be assumed.

According to the PDF Reference 1.7, the default encoding should be 'StandardEncoding'. PdfParser currently does not supply any default, so when it queries for the BaseEncoding, it returns an empty string.

Should be as simple as inserting a check in Encoding->getEncodingClass():

    /**
     * @throws EncodingNotFoundException
     */
    protected function getEncodingClass(): string
    {
        // Load reference table charset.
        $baseEncoding = preg_replace('/[^A-Z0-9]/is', '', $this->get('BaseEncoding')->getContent());

        // Check for empty BaseEncoding
        if ('' == $baseEncoding) $baseEncoding = 'StandardEncoding';

        $className = '\\Smalot\\PdfParser\\Encoding\\'.$baseEncoding;

        if (!class_exists($className)) {
            throw new EncodingNotFoundException('Missing encoding data for: "'.$baseEncoding.'".');
        }

        return $className;
    }

Are we able to use your document in the PdfParser test suite, @terrafrost ?

terrafrost commented 5 months ago

Are we able to use your document in the PdfParser test suite, @terrafrost ?

Feel free!