Closed micos7 closed 3 months ago
Looks like this is happening because the test in the elseif
chain before this is a preg_match
that also sets a $matches
variable, overwriting the one from the first preg_match
.
...
} elseif (strpos($pdfData, 'xref', $offset) == $offset) {
// Already pointing at the xref table
$startxref = $offset;
} elseif (preg_match('/([0-9]+[\s][0-9]+[\s]obj)/i', $pdfData, $matches, \PREG_OFFSET_CAPTURE, $offset)) {
// $matches gets set here ^ by this test, even if it fails
// Cross-Reference Stream object
$startxref = $offset;
} elseif ($startxrefPreg) {
// startxref found
$startxref = $matches[1][0];
// This is the wrong $matches ^ now
} else {
throw new \Exception('Unable to find startxref');
}
...
Also, the example file from the OP is a gigantic 712 page PDF that so far has not finished parsing since I started writing this post! :D I would not recommend using PdfParser to extract text from this file. You should probably use an online tool that separates all the pages into individual PDF files and running PdfParser on those.
Edit: After churning for 10 minutes on this file, PHP ran out of memory, lol.
I`m using it to count the pages for a middleware in laravel , for extracting text I use python , it can digest weird formats.I have 2000 + pages pdfs...
On further investigation, this is actually happening because either the example PDF isn't giving the correct, to-the-byte offset for the start of an xref
object, or PdfParser isn't being lenient enough when checking the current offset against the content of the file.
One of the tests in RawDataParser->getXrefData()
is as follows:
} elseif (strpos($pdfData, 'xref', $offset) == $offset) {
This checks to see if, for the given $offset
in the PDF, there is an xref
statement, and if so we should start parsing content here. However, this check is to-the-byte strict. In the case of this PDF, the $offset
value given actually points to a whitespace character (carriage return followed by a newline) two bytes before the xref
. So when PdfParser fails to find the xref
at the exact $offset
value, it actually falls into a loop trying (and failing) to find it over and over and over, which is where PHP was running out of memory.
When I add the following code to "bump the caret" past any whitespace at the current offset, the xref
command is found and this huge PDF is actually parsed and displayed in a remarkably short time:
while (preg_match('/\s/', substr($pdfData, $offset, 1))) {
$offset++;
}
if (0 == $offset) {
...
I'm not sure this is the best solution yet, and I haven't run it through the unit tests either. However, adding this code with no other changes allows parsing of the OP's example file.
Edit: All unit tests pass with addition of this code. I'm studying the PDF Reference to see if there are considerations for offset values to be lenient with whitespace like this.
This might be because in the PDF header, when loaded as ISO-8859-1, we see the following:
%PDF-1.3
%âãÏÓ
3390 0 obj
...
But when loaded as UTF-8, the four special characters are merged into two unknown characters, perhaps lopping off two bytes from every offset value:
%PDF-1.3
%??
3390 0 obj
...
It's a plausible cause, but I'm not sure it's the actual one. Other PDFs have headers just like these and their offset values aren't off by two bytes.
So, this file contains a Prev 7123863
command which references the character position of the previous XRef block. Loading the file as a string and doing a var_dump(substr($pdfdata, 7123863, 200));
results in:
string(200) "
xref
0 3390
0000000000 65535 f
0000667726 00000 n
0000667861 00000 n
0000668830 00000 n
0000668970 00000 n
0000669939 00000 n
0000670272 00000 n
0000670294 00000 n
0000670434 00000 n
00006"
You can see that the string begins with a newline character (in fact a carriage-return plus newline \r\n
) and the xref starts on the next line. PdfParser expects the xref
text to be at exactly character position 7123863, instead of 7123865. When it does not find the xref
text, it stops looking for xref
and instead scans the document from this offset for the next startxref
command. The one it finds is one it's seen before though, the one that contains the Prev 7123863
command, so PdfParser falls into an endless loop at this point.
The PDF Reference is not exactly clear on this, but in theory, an incorrect XRef offset value should cause an error and the PDF should fail to display. However, in practice, Adobe Acrobat is loading the OP's sample file and displaying it without error. Obviously Acrobat accounts for this and deals with it internally.
Therefore I believe that my "bump the caret" code above is probably an acceptable solution to this. What do you think, @k00ni ?
Therefore I believe that my "bump the caret" code above is probably an acceptable solution to this. What do you think, @k00ni?
:+1: Sounds reasonable. Can you provide a PR?
@micos7 Can we use your PDF for our test environment (it must be free of charge and without any obligations)? If so, please reupload.
Description:
PDF input
Cracking-the-Coding-Interview-6th-Edition-189-Programming-Questions-and-Solutions.pdf
Expected output & actual output
It crashes in the RawDataPraser, line 890
elseif ($startxrefPreg) { // startxref found $startxref = $matches[1][0]; }
$matches is empty array.Code
Just the usual stuff.