smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Undefined array key 1,crash on parsing #673

Closed micos7 closed 3 months ago

micos7 commented 5 months ago

Description:

PDF input

Cracking-the-Coding-Interview-6th-Edition-189-Programming-Questions-and-Solutions.pdf

Expected output & actual output

It crashes in the RawDataPraser, line 890

elseif ($startxrefPreg) { // startxref found $startxref = $matches[1][0]; } $matches is empty array.

Code

Just the usual stuff.

GreyWyvern commented 5 months ago

Looks like this is happening because the test in the elseif chain before this is a preg_match that also sets a $matches variable, overwriting the one from the first preg_match.

...
        } elseif (strpos($pdfData, 'xref', $offset) == $offset) {
            // Already pointing at the xref table
            $startxref = $offset;
        } elseif (preg_match('/([0-9]+[\s][0-9]+[\s]obj)/i', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset)) {
                                                // $matches gets set here ^ by this test, even if it fails

            // Cross-Reference Stream object
            $startxref = $offset;
        } elseif ($startxrefPreg) {
            // startxref found
            $startxref = $matches[1][0];
// This is the wrong $matches ^ now

        } else {
            throw new \Exception('Unable to find startxref');
        }
...

Also, the example file from the OP is a gigantic 712 page PDF that so far has not finished parsing since I started writing this post! :D I would not recommend using PdfParser to extract text from this file. You should probably use an online tool that separates all the pages into individual PDF files and running PdfParser on those.

Edit: After churning for 10 minutes on this file, PHP ran out of memory, lol.

micos7 commented 5 months ago

I`m using it to count the pages for a middleware in laravel , for extracting text I use python , it can digest weird formats.I have 2000 + pages pdfs...

GreyWyvern commented 5 months ago

On further investigation, this is actually happening because either the example PDF isn't giving the correct, to-the-byte offset for the start of an xref object, or PdfParser isn't being lenient enough when checking the current offset against the content of the file.

One of the tests in RawDataParser->getXrefData() is as follows:

        } elseif (strpos($pdfData, 'xref', $offset) == $offset) {

This checks to see if, for the given $offset in the PDF, there is an xref statement, and if so we should start parsing content here. However, this check is to-the-byte strict. In the case of this PDF, the $offset value given actually points to a whitespace character (carriage return followed by a newline) two bytes before the xref. So when PdfParser fails to find the xref at the exact $offset value, it actually falls into a loop trying (and failing) to find it over and over and over, which is where PHP was running out of memory.

When I add the following code to "bump the caret" past any whitespace at the current offset, the xref command is found and this huge PDF is actually parsed and displayed in a remarkably short time:

        while (preg_match('/\s/', substr($pdfData, $offset, 1))) {
            $offset++;
        }

        if (0 == $offset) {
            ...

I'm not sure this is the best solution yet, and I haven't run it through the unit tests either. However, adding this code with no other changes allows parsing of the OP's example file.

Edit: All unit tests pass with addition of this code. I'm studying the PDF Reference to see if there are considerations for offset values to be lenient with whitespace like this.

This might be because in the PDF header, when loaded as ISO-8859-1, we see the following:

%PDF-1.3
%âãÏÓ
3390 0 obj
...

But when loaded as UTF-8, the four special characters are merged into two unknown characters, perhaps lopping off two bytes from every offset value:

%PDF-1.3
%??
3390 0 obj
...

It's a plausible cause, but I'm not sure it's the actual one. Other PDFs have headers just like these and their offset values aren't off by two bytes.

GreyWyvern commented 3 months ago

So, this file contains a Prev 7123863 command which references the character position of the previous XRef block. Loading the file as a string and doing a var_dump(substr($pdfdata, 7123863, 200)); results in:

string(200) "
xref
0 3390 
0000000000 65535 f
0000667726 00000 n
0000667861 00000 n
0000668830 00000 n
0000668970 00000 n
0000669939 00000 n
0000670272 00000 n
0000670294 00000 n
0000670434 00000 n
00006"

You can see that the string begins with a newline character (in fact a carriage-return plus newline \r\n) and the xref starts on the next line. PdfParser expects the xref text to be at exactly character position 7123863, instead of 7123865. When it does not find the xref text, it stops looking for xref and instead scans the document from this offset for the next startxref command. The one it finds is one it's seen before though, the one that contains the Prev 7123863 command, so PdfParser falls into an endless loop at this point.

The PDF Reference is not exactly clear on this, but in theory, an incorrect XRef offset value should cause an error and the PDF should fail to display. However, in practice, Adobe Acrobat is loading the OP's sample file and displaying it without error. Obviously Acrobat accounts for this and deals with it internally.

Therefore I believe that my "bump the caret" code above is probably an acceptable solution to this. What do you think, @k00ni ?

k00ni commented 3 months ago

Therefore I believe that my "bump the caret" code above is probably an acceptable solution to this. What do you think, @k00ni?

:+1: Sounds reasonable. Can you provide a PR?

@micos7 Can we use your PDF for our test environment (it must be free of charge and without any obligations)? If so, please reupload.