smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 537 forks source link

can't parse fdpf file from 1.86 version of FPDF and works fine with FPDF 1.81 #703

Open Saulight73 opened 7 months ago

Saulight73 commented 7 months ago

The error we have in our logs comes from when we parse the data of the pages. We are using a PDF generated by the latest version of FPDF, version 1.86. However, the last version where this error did not occur is 1.81. Therefore, we would like to have, if possible, an idea of what could be causing this error:

Undefined array key 0 in /var/www/clients/client1/web10/web/application/library/php/pdfparser-2.5.0/src/Smalot/PdfParser/Page.php on line 284.

Even with version 2.9.0 of your parser, the error persists. Therefore, I am attaching my PHP parsing code below:

    private static function getXandYofPDFText(string $stringtosearch, string $pdfLink, int $documentID){
        if (!is_string($stringtosearch) || empty($stringtosearch)) {
            throw new Exception(ErrorCodesHelper::get("INVALID_PARAMETERS",["stringtosearch"]));
        }

        if (!is_string($pdfLink) || empty($pdfLink)) {
            throw new Exception(ErrorCodesHelper::get("INVALID_PARAMETERS",["pdfLink"]));
        }

        if (!is_int($documentID)) {
            throw new Exception(ErrorCodesHelper::get("INVALID_PARAMETERS",["documentID"]));
        }

        if ($documentID <= 0) {
            throw new Exception(ErrorCodesHelper::get("INVALID_PARAMETERS",["documentID"]));
        }

        $parser = new \Smalot\PdfParser\Parser();

        $globalArray = array();
        $pdf = $parser->parseContent( @file_get_contents( $pdfLink ) );

        if( $pdf === null )
        {
            throw new Exception(ErrorCodesHelper::get("DOCUSIGN_API_CALL_ERROR",["Impossible de parser le document suivant : ".$pdfLink]));
        }

        $compteurpage = 1;
        $pages = $pdf->getPages();

        if( $pages === null )
        {
            throw new Exception(ErrorCodesHelper::get("DOCUSIGN_API_CALL_ERROR",["Impossible de parser les pages du document suivant : ".$pdfLink]));
        }

        foreach( $pages as $pagenumber )

        {

            // print_r($pagenumber);

            /**
             * Récupération du texte et des informations associées (ancres, textes, coordonnées du début de la ligne depuis en bas à gauche, etc.)
             */
            $dataTm = $pagenumber->getDataTm(); 

            if( $dataTm == null )
            {
                throw new Exception(ErrorCodesHelper::get("DOCUSIGN_API_CALL_ERROR",["Impossible de parser la data des pages pour le document suivant : ".$pdfLink]));
            }

            $compteurindex = 0;
            foreach( $dataTm as $a )
            {
                if ( str_contains( $a[ 1 ], $stringtosearch ) ) 
                {
                    /**
                     * Je récupère les coordonnées X et Y, le numéro de la page, le numéro d'ordre du signataire et le numéro d'ordre du document.
                     */
                    $line = $dataTm[ (string)$compteurindex ];
                    $x = (int)$line[ 0 ][ 4 ];
                    $y = 859 - (int)$line[ 0 ][ 5 ];
                    $array = [ $x, $y, $compteurpage, $documentID ];

                    @array_push( $globalArray, $array );
                }
                $compteurindex++;
            }
            $compteurpage++;

        }

        return $globalArray;

    }

Thank you for providing us with prompt assistance for our production solution.

Best regards,

GLENAT Group

Saulight73 commented 5 months ago

@k00ni any news about this issue ? we can't work well parsing PDF's with 2.10 version. its also the same error :

[Tue May 28 10:46:34.576806 2024] [proxy_fcgi:error] [pid 3767717] [client 10.1.21.27:53967] AH01071: Got error 'PHP message: PHP Warning: Undefined array key 0 in /var/www/clients/client1/web10/web/application/library/php/pdfparser-2.10.0/src/Smalot/PdfParser/Page.php on line 279; PHP message: PHP Fatal error: Uncaught TypeError: Smalot\PdfParser\Page::getPDFObjectForFpdf(): Return value must be of type Smalot\PdfParser\PDFObject, null returned in /var/www/clients/client1/web10/web/application/library/php/pdfparser-2.10.0/src/Smalot/PdfParser/Page.php:279\nStack trace:\n#0 /var/www/clients/client1/web10/web/application/library/php/pdfparser-2.10.0/src/Smalot/PdfParser/Page.php(399): Smalot\PdfParser\Page->getPDFObjectForFpdf()\n#1 /var/www/clients/client1/web10/web/application/library/php/pdfparser-2.10.0/src/Smalot/PdfParser/Page.php(424): Smalot\PdfParser\Page->extractRawData()\n#2 /var/www/clients/client1/web10/web/application/library/php/pdfparser-2.10.0/src/Smalot/PdfParser/Page.php(504): Smalot\PdfParser\Page->extractDecodedRawData()\n#3 /var/www/clients/client1/web10/web/application/library/php/pdfparser-2.10.0/src/Smalot/PdfParser/Page.php(655): Smalot\PdfParser\Page->getDataCommands()\n#4 /var/www/clients/client1/web10/web/application/library/php/Glenat/App/DocuSignApp....', referer: http://core-test.glenat.com/

k00ni commented 5 months ago

Please upload a PDF here which causes this problem.

Saulight73 commented 5 months ago

PDF Problem.pdf

Here it is.

k00ni commented 5 months ago

I tried your PDF, but PDFParser reported a different error:

PHPUnitTests\Integration\ParserTest::testIssue703 Exception: Invalid object reference for $obj. /var/www/html/src/Smalot/PdfParser/RawData/RawDataParser.php:536 /var/www/html/src/Smalot/PdfParser/RawData/RawDataParser.php:242 /var/www/html/src/Smalot/PdfParser/RawData/RawDataParser.php:918 /var/www/html/src/Smalot/PdfParser/RawData/RawDataParser.php:952 /var/www/html/src/Smalot/PdfParser/Parser.php:103 /var/www/html/tests/PHPUnit/Integration/ParserTest.php:446 phpvfscomposer:///var/www/html/dev-tools/vendor/phpunit/phpunit/phpunit:106

Here is my test code (separate branch issue/703):

https://github.com/smalot/pdfparser/blob/issue/703/tests/PHPUnit/Integration/ParserTest.php#L442-L464

I may have made a mistake somewhere, can you have a look please?

Saulight73 commented 5 months ago

$localPdfPath = './tpm.pdf'; file_put_contents($localPdfPath, file_get_contents($pdfLink)); $pdf = $parser->parseFile($localPdfPath);

here it our code same as yours so didnt know why you have different error! Can you try to save localy the pdf in en tmp file like use maybe solve this problem

k00ni commented 5 months ago

I tried it locally and got the error I mentioned. After #719 got merged, we can run CI for side-branches too and will see if the same error occurs.

k00ni commented 5 months ago

#719 got merged.

My last found error is also shown online for your given PDF, for instance here for PHP 7.2: https://github.com/smalot/pdfparser/actions/runs/9386160259/job/25846037622#step:6:37 (or the same error here for PHP 8.3)

Exception: Invalid object reference for $obj. /home/runner/work/pdfparser/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:536 /home/runner/work/pdfparser/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:242 /home/runner/work/pdfparser/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:918 /home/runner/work/pdfparser/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:952 /home/runner/work/pdfparser/pdfparser/src/Smalot/PdfParser/Parser.php:103 /home/runner/work/pdfparser/pdfparser/tests/PHPUnit/Integration/ParserTest.php:453

Also, which PHP version do you use?

Btw. might be the same error as in #714

Saulight73 commented 5 months ago

#719 got merged.

My last found error is also shown online for your given PDF, for instance here for PHP 7.2: https://github.com/smalot/pdfparser/actions/runs/9386160259/job/25846037622#step:6:37 (or the same error here for PHP 8.3)

Exception: Invalid object reference for $obj. /home/runner/work/pdfparser/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:536 /home/runner/work/pdfparser/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:242 /home/runner/work/pdfparser/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:918 /home/runner/work/pdfparser/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:952 /home/runner/work/pdfparser/pdfparser/src/Smalot/PdfParser/Parser.php:103 /home/runner/work/pdfparser/pdfparser/tests/PHPUnit/Integration/ParserTest.php:453

Also, which PHP version do you use?

Btw. might be the same error as in #714

We use PHP 8.2 and before we try in PHP 8.3 i can give you some others PDF if you want generated with other parameters and version :

test3_fpdfSeul.pdf test4_iso.pdf test5_serveurapache.pdf test6_serveurapacheIso.pdf test7_serveurapacheIsoV1.pdf test9_V2-1.pdf test9_V2-2.pdf test9_V2-3.pdf test10-V2-bloc-fin.pdf test11-v2-blocFin.pdf