smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Bug in RawDataParser.php when a row has no data #679

Closed KeanuTang closed 3 months ago

KeanuTang commented 4 months ago

Description:

PDF input

The PDF has personal identifying information that I cannot share through a post like this. I could maybe email it directly if necessary.

Expected output & actual output

I'm just trying to call

Code

                            case 'pdf':
                                try {
                                    $parser = new Parser([], $config);
                                    $PDF = $parser->parseFile($filepath);
                                    $pdfText = mb_convert_encoding($PDF->getText(), 'UTF-8');
                                    $result .= $pdfText;
                                } catch (\Exception $e) {
                                    \Log::error('Error parsing recommendation PDF text: '.$e->getMessage());
                                    break;
                                }

I'm getting this error:

image

In \smalot\pdfpaser\src\Smalot\PdfParser\RawData\RawDataParser.php, the issue seems to be coming from this code:

            } else {
                // number of bytes in a row
                $rowlen = array_sum($wb);
                // convert the stream into an array of integers
                $sdata = unpack('C*', $xrefcrs[1][3][0]);
                // split the rows
                $ddata = array_chunk($sdata, $rowlen);
            }

I wonder if it should be something like this?

            } else {
                // number of bytes in a row
                $rowlen = array_sum($wb);
                if ($rowlen > 0) {
                    // convert the stream into an array of integers
                    $sdata = unpack('C*', $xrefcrs[1][3][0]);
                    // split the rows
                    $ddata = array_chunk($sdata, $rowlen);
                } else {
                    $ddata = [];
                }
            }
k00ni commented 4 months ago

Thanks for report.

I wonder if it should be something like this?

Might be. Can you elaborate a bit on this code?

KeanuTang commented 4 months ago

I'm very new to this code and library, but the code snippet comes from: \smalot\pdfpaser\src\Smalot\PdfParser\RawData\RawDataParser.php

It seems like the issue with the code is it assumes that every row of data from the PDF file is actually going to have data in it. Here's a larger snippet from RawDataParser.php:

        // decode data
        if ($valid_crs && isset($xrefcrs[1][3][0])) {
            if ($predictor !== null) {
                // number of bytes in a row
                $rowlen = ($columns + 1);
                // convert the stream into an array of integers
                ...
                } // end for each row
                // complete decoding
            } else {
                // number of bytes in a row
                $rowlen = array_sum($wb);
                // convert the stream into an array of integers
                $sdata = unpack('C*', $xrefcrs[1][3][0]);
                // split the rows
                $ddata = array_chunk($sdata, $rowlen);
            }

            $sdata = [];

            // for every row
            foreach ($ddata as $k => $row) {
                // initialize new row
                $sdata[$k] = [0, 0, 0];
                if ($wb[0] == 0) {
                    // default type field
                    $sdata[$k][0] = 1;
                }
                $i = 0; // count bytes in the row
                // for every column
                for ($c = 0; $c < 3; $c++) {
                    // for every byte on the column
                    for ($b = 0; $b < $wb[$c]; $b++) {
                        if (isset($row[$i])) {
                            $sdata[$k][$c] += ($row[$i] << (($wb[$c] - 1 - $b) * 8));
                        }
                        $i++;
                    }
                }
            }

Somehow my end user generated a PDF that has a row where a row of data ends up empty, and the row length is zero. This line of code returns a zero. $rowlen = array_sum($wb);

So my proposal is, instead of assuming a row always results in $rowlen>0, we handle it accordingly:

            } else {
                // number of bytes in a row
                $rowlen = array_sum($wb);
                if ($rowlen > 0) {
                    // convert the stream into an array of integers
                    $sdata = unpack('C*', $xrefcrs[1][3][0]);
                    // split the rows
                    $ddata = array_chunk($sdata, $rowlen);
                } else {
                    // if the row length is zero, $ddata should be an empty array as well
                    $ddata = [];
                }
            }
k00ni commented 4 months ago

Can you upload the PDF here?

CC @GreyWyvern Do you agree with the suggested adaption of the code?

KeanuTang commented 4 months ago

I'll upload it for now but will delete it when we're done investigating. It has information that I don't want to persist here.

GreyWyvern commented 4 months ago

I haven't really worked with RawDataParser.php much, and a lot of what it does is still voodoo to me. The issue is with the /W [0 0 0] commands near the bottom of the file @KeanuTang just posted. Even after reading the PDF Reference (v1.7 pp 108-109) concerning this command, I'm still not completely wrapping my head around it.

In any case, the PDF Reference says the following:

A value of zero for an element in the W array indicates that the corresponding field is not present in the stream, and the default value is used, if there is one. If the first element is zero, the type field is not present, and it defaults to type 1.

So it seems to me that a /W [0 0 0] command is indicating that none of the fields are present, and that this stream is essentially "empty". There are eight bytes in the associated streams, but with "field lengths" of all zeros, there is no way to know how to chunk or decode it. They're probably just placeholder bytes IMHO.

I think we're fine to return an empty array in that case, as in @KeanuTang's suggested code.

KeanuTang commented 4 months ago

Sounds good to me. I haven't spent much time looking at RawDataParser.php myself. I just investigated where the error was coming from and saw the code above. I'll trust you all to do what's best for the pdfparser code.

I've removed the PDF file. Just let me know if you need it again.

GreyWyvern commented 4 months ago

I'm finding it very difficult to create a test-case for this since the function in question decodeXrefStream() requires a fully valid PDF with correct reference offset numbers as input. I've tried setting the /W values for an existing XRef object (in Issue322.pdf from the samples/bugs folder) to 0 0 0 but some logic in PDFParser catches it before it gets to the point that triggers the OP's original error.

So unless @KeanuTang can create a new document we can use that exhibits the same behaviour, we're stuck.

Do you think we could just add this code in a PR without a test-case, @k00ni?

            // complete decoding
            } else {
                // number of bytes in a row
                $rowlen = array_sum($wb);
                if (0 < $rowlen) {
                    // convert the stream into an array of integers
                    $sdata = unpack('C*', $xrefcrs[1][3][0]);
                    // split the rows
                    $ddata = array_chunk($sdata, $rowlen);
                } else {
                    // if the row length is zero, $ddata should be an empty array as well
                    $ddata = [];
                }
            }
KeanuTang commented 4 months ago

@GreyWyvern Did you get the file I uploaded from before? I think this was the problematic file. I didn't create the PDF, so I'm not sure how it was first generated.

GreyWyvern commented 4 months ago

@GreyWyvern Did you get the file I uploaded from before? I think this was the problematic file. I didn't create the PDF, so I'm not sure how it was first generated.

Yes, I have it, but you've mentioned we can't use it because it contains some sensitive information, which is fine. I've also tried editing your file by removing all the text, but any client I use to save it deletes the /w [0 0 0] XRef's that are causing this. Acrobat actually hangs on me when trying to load your file, even though you can still read it, which is weird.

If possible, you might want to contact the individual who created this file and sent it to you, ask them to create a new blank document in the same way, save it and send it back to you.

It is just one single test for a zero condition, so it might be okay to just commit it without a unit test.