Closed KeanuTang closed 3 months ago
Thanks for report.
I wonder if it should be something like this?
Might be. Can you elaborate a bit on this code?
I'm very new to this code and library, but the code snippet comes from: \smalot\pdfpaser\src\Smalot\PdfParser\RawData\RawDataParser.php
It seems like the issue with the code is it assumes that every row of data from the PDF file is actually going to have data in it. Here's a larger snippet from RawDataParser.php:
// decode data
if ($valid_crs && isset($xrefcrs[1][3][0])) {
if ($predictor !== null) {
// number of bytes in a row
$rowlen = ($columns + 1);
// convert the stream into an array of integers
...
} // end for each row
// complete decoding
} else {
// number of bytes in a row
$rowlen = array_sum($wb);
// convert the stream into an array of integers
$sdata = unpack('C*', $xrefcrs[1][3][0]);
// split the rows
$ddata = array_chunk($sdata, $rowlen);
}
$sdata = [];
// for every row
foreach ($ddata as $k => $row) {
// initialize new row
$sdata[$k] = [0, 0, 0];
if ($wb[0] == 0) {
// default type field
$sdata[$k][0] = 1;
}
$i = 0; // count bytes in the row
// for every column
for ($c = 0; $c < 3; $c++) {
// for every byte on the column
for ($b = 0; $b < $wb[$c]; $b++) {
if (isset($row[$i])) {
$sdata[$k][$c] += ($row[$i] << (($wb[$c] - 1 - $b) * 8));
}
$i++;
}
}
}
Somehow my end user generated a PDF that has a row where a row of data ends up empty, and the row length is zero. This line of code returns a zero.
$rowlen = array_sum($wb);
So my proposal is, instead of assuming a row always results in $rowlen>0, we handle it accordingly:
} else {
// number of bytes in a row
$rowlen = array_sum($wb);
if ($rowlen > 0) {
// convert the stream into an array of integers
$sdata = unpack('C*', $xrefcrs[1][3][0]);
// split the rows
$ddata = array_chunk($sdata, $rowlen);
} else {
// if the row length is zero, $ddata should be an empty array as well
$ddata = [];
}
}
Can you upload the PDF here?
CC @GreyWyvern Do you agree with the suggested adaption of the code?
I'll upload it for now but will delete it when we're done investigating. It has information that I don't want to persist here.
I haven't really worked with RawDataParser.php much, and a lot of what it does is still voodoo to me. The issue is with the /W [0 0 0]
commands near the bottom of the file @KeanuTang just posted. Even after reading the PDF Reference (v1.7 pp 108-109) concerning this command, I'm still not completely wrapping my head around it.
In any case, the PDF Reference says the following:
A value of zero for an element in the W array indicates that the corresponding field is not present in the stream, and the default value is used, if there is one. If the first element is zero, the type field is not present, and it defaults to type 1.
So it seems to me that a /W [0 0 0]
command is indicating that none of the fields are present, and that this stream is essentially "empty". There are eight bytes in the associated streams, but with "field lengths" of all zeros, there is no way to know how to chunk or decode it. They're probably just placeholder bytes IMHO.
I think we're fine to return an empty array in that case, as in @KeanuTang's suggested code.
Sounds good to me. I haven't spent much time looking at RawDataParser.php myself. I just investigated where the error was coming from and saw the code above. I'll trust you all to do what's best for the pdfparser code.
I've removed the PDF file. Just let me know if you need it again.
I'm finding it very difficult to create a test-case for this since the function in question decodeXrefStream()
requires a fully valid PDF with correct reference offset numbers as input. I've tried setting the /W
values for an existing XRef object (in Issue322.pdf from the samples/bugs folder) to 0 0 0
but some logic in PDFParser catches it before it gets to the point that triggers the OP's original error.
So unless @KeanuTang can create a new document we can use that exhibits the same behaviour, we're stuck.
Do you think we could just add this code in a PR without a test-case, @k00ni?
// complete decoding
} else {
// number of bytes in a row
$rowlen = array_sum($wb);
if (0 < $rowlen) {
// convert the stream into an array of integers
$sdata = unpack('C*', $xrefcrs[1][3][0]);
// split the rows
$ddata = array_chunk($sdata, $rowlen);
} else {
// if the row length is zero, $ddata should be an empty array as well
$ddata = [];
}
}
@GreyWyvern Did you get the file I uploaded from before? I think this was the problematic file. I didn't create the PDF, so I'm not sure how it was first generated.
@GreyWyvern Did you get the file I uploaded from before? I think this was the problematic file. I didn't create the PDF, so I'm not sure how it was first generated.
Yes, I have it, but you've mentioned we can't use it because it contains some sensitive information, which is fine. I've also tried editing your file by removing all the text, but any client I use to save it deletes the /w [0 0 0]
XRef's that are causing this. Acrobat actually hangs on me when trying to load your file, even though you can still read it, which is weird.
If possible, you might want to contact the individual who created this file and sent it to you, ask them to create a new blank document in the same way, save it and send it back to you.
It is just one single test for a zero condition, so it might be okay to just commit it without a unit test.
Description:
PDF input
The PDF has personal identifying information that I cannot share through a post like this. I could maybe email it directly if necessary.
Expected output & actual output
I'm just trying to call
Code
I'm getting this error:
In \smalot\pdfpaser\src\Smalot\PdfParser\RawData\RawDataParser.php, the issue seems to be coming from this code:
I wonder if it should be something like this?