smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Fatal Error when parsing some PDFs #655

Open soupmagnet opened 7 months ago

soupmagnet commented 7 months ago

Description:

Very recently started getting the following Fatal Error when trying to parse some PDF files...

PHP Fatal error: Uncaught Exception: Invalid object reference for $obj. in >../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php:529 Stack trace:

0 ../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php(240): >Smalot\PdfParser\RawData\RawDataParser->getIndirectObject('%PDF-1.4\r\n%\xF9\xFA\x9A\xE7...', Array, '4', 203, true)

1 ../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php(905): >Smalot\PdfParser\RawData\RawDataParser->decodeXrefStream('%PDF-1.4\r\n%\xF9\xFA\x9A\xE7...', 203, Array)

2 ../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php(216): >Smalot\PdfParser\RawData\RawDataParser->getXrefData('%PDF-1.4\r\n%\xF9\xFA\x9A\xE7...', 203, Array)

3 ../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php(902): Smalot\PdfParser in >../vendor/smalot/pdfparser/src/Smalot/PdfParser/RawData/RawDataParser.php on line 529"

PDF input

I would be willing to provide a copy of the PDF if I can do so privately.

Expected output & actual output

The expected output of my code is the contents of the PDF parsed into a string of text and ultimately saved to a variable, instead there is a fatal error on certain PDF files and I really can't tell why.

Code

$ext = pathinfo($path, PATHINFO_EXTENSION);
if ( $ext == 'pdf' || $ext == 'PDF') {
    $parser = new \Smalot\PdfParser\Parser();
    $pdf = $parser->parseFile($path);
    $text = $pdf->getText();
}
return $text;
k00ni commented 7 months ago

Please try again with our latest version 2.8.0-RC2

Chandlr commented 6 months ago

Ive run into a issue with (latest version 2.8.0-RC2) and i was using this code:

                $config = new \Smalot\PdfParser\Config();                
                $config->setFontSpaceLimit(-60);
                $config->setRetainImageContent(false);
                $config->setIgnoreEncryption(true);

                // Memory limit to use when de-compressing files, in bytes
                $config->setDecodeMemoryLimit(10240);
                $parser = new \Smalot\PdfParser\Parser([], $config);

                $PDF = $parser->parseFile($PDFfile);
                $metaData = $PDF->getDetails();

                die(json_encode($metaData, JSON_PRETTY_PRINT | JSON_UNESCAPED_UNICODE | JSON_UNESCAPED_SLASHES));

expected result would be similar to this:

Code: 200 - {
    "CreationDate": "2019-10-31T08:27:44+01:00",
    "ModDate": "2019-12-10T07:07:05+01:00",
    "Producer": "iText® 5.5.10 ©2000-2015 iText Group NV (****)",
    "Pages": 3364, <--- notice this works
    "xmp:createdate": "2019-10-31T08:27:44+01:00",
    "xmp:modifydate": "2019-12-10T07:07:05+01:00",
    "xmp:metadatadate": "2019-12-10T07:07:05+01:00",
    "pdf:producer": "iText® 5.5.10 ©2000-2015 iText Group NV (***)",
    "xmpmm:documentid": "uuid:5c870642-b206-4312-8c05-2646e3c946a0",
    "xmpmm:instanceid": "uuid:729bb9a6-a048-4bcc-996d-d44ca9a5555c",
    "dc:format": "application/pdf"
}

The bug iam getting with a bigger PDF (4546 pages) gives this result with that same php code above:

Code: 200 - {
    "Pages": 189
}

pdf is: 387 MB (406 340 557 byte)

k00ni commented 6 months ago

Thank you for confirming.

Chandlr commented 3 months ago

I ve got a sample PDFfile regarding similar issue, might be a "index" issue since this code works only to 9th page example code:

for ($x = 0; $x <= 16; $x++) {
   $pgcontent = $PDF->getPages()[$x]->getText();
   echo("PageNr:".$x."\r\n".$pgcontent);
     }
   die("Done");

this gives 500 server error even with try and except:

try
{
 $PDFContent = $PDF->getText(16);
}
catch (\Exception $e)
{
    die( "PDF Problem: " . $e->getMessage());
}

When looking inside the pdffile with FoxIT reader, it reacts likes there is a index issue around pages 8-9. Is it possible to send the pdffile and keeping it private ? :) (feel free to PM me and ask for the file)