pauln / tcpdi_parser

Parser for use with TCPDI, based on TCPDF_PARSER
GNU Lesser General Public License v3.0
27 stars 47 forks source link

Stream length wrapped in an object in PDF-1.3 and PDF-1.4 causes infinite loop #2

Closed puxan closed 10 years ago

puxan commented 10 years ago

I've been trying to import PDF files using the following code:

$pdf = new TCPDI(PDF_PAGE_ORIENTATION, PDF_UNIT, PDF_PAGE_FORMAT, true, 'UTF-8', false);

// Import template
$pdf->AddPage ();
$pdf->setSourceFile ($path);
$idx = $pdf->importPage (1);
$pdf->useTemplate ($idx);

echo $pdf->Output ();

I had no problems with 1.5 and 1.7 PDF versions, but when I try it with 1.3 or 1.4 versions, the loop in getIndirectObject() never ends.

An example of a PDF not working: https://www.dropbox.com/s/9ax2lc5fed4erit/1.pdf

I've been trying to understand what is wrong, but I don't know enough about PDF formats.

Thanks

richplane commented 10 years ago

I'm experiencing the same problem. The last function I saw called in my stack before I ran out of memory was getRawObject(), so I logged every call to this function and traced it back to the calls from getIndirectObject() as jpuxan reported.

It's not finding the "endobj" element because a call to getRawObject() is returning the same offset it's been given. I put a test in to the getIndirectObject function to var_dump the $element returned from getRawObject() if the $offset isn't changed - the result was a 2-element empty array.

So - the problem is that if there's anything encountered which doesn't match any of the switch() cases in getRawObject, the offset doesn't get moved on and the parser just loops indefinitely.

I don't know much about the anatomy of PDFs but it seems that getRawObject tries a load of marker characters, and if these are inconclusive it makes a fragment and tests this; if this is inconclusive it tries a regular expression and finally looks for a numeric value.

I dumped the $frag fragment that it's testing against; got this: “ÜÄ

turns out this is the first element in an array block (enclosed with square brackets)

Did the same with the PDF jpuxan uploaded - the fragment breaking his was Â@ E

which only occurs in mid-stream.

So it could just be that $offset is being thrown out completely by something else entirely, so getRawObject() encounters this character sequence where it doesn't expect it.

Hope this is useful.

pauln commented 10 years ago

Thanks for your report - and for the example file, @jpuxan. This should now be fixed, at least for the file you provided (and any others with that issue - the stream length was wrapped in an object and referenced from the stream info dict, rather than just put directly in it as a number; that object's identifier was being misinterpreted as the stream length, resulting in it looking for the next token in the wrong place). If you are still experiencing similar issues with other files, please raise another issue with relevant example file(s) and I'll investigate further.

richplane commented 10 years ago

Confirm that this fixes it for my PDF too. Many thanks, Paul.

puxan commented 10 years ago

Thanks Paul!