pauln / tcpdi_parser

Parser for use with TCPDI, based on TCPDF_PARSER
GNU Lesser General Public License v3.0
27 stars 47 forks source link

Unable to parse v1.3 PDFs created by DOMPDF #3

Closed richplane closed 10 years ago

richplane commented 10 years ago

I've not read anything that's led me to believe that TCPDI shouldn't work with v1.3 PDFs but each time we try to setSourceFile to a PDF generated by DOMPDF I get "TCPDF_PARSER ERROR: Invalid object reference: Array". The PDFs will open in Acrobat Reader without complaint.

Example PDF here: http://test.semlyen.net/problematic-PDF.pdf (not from a real rent agreement!)

Any ideas?

pauln commented 10 years ago

It looks like DOMPDF is miscalculating the "startxref" value - so it was pointing at the linebreak before the xref table instead of the xref table itself. In order to work around this, I've just pushed a commit which skips past any CR or LF characters at the startxref position - this fixes it for your example PDF, and hopefully any others generated by DOMPDF.

pauln commented 10 years ago

I've submitted a pull request to DOMPDF to fix it (they just weren't adding the \n to the calculated offset). With this latest patch, tcpid_parser should handle PDFs generated by DOMPDF whether or not the startxref is fixed.

richplane commented 10 years ago

Wow - many thanks and much respect to you. I'd like to donate something for your time. How do I do this?

I've had some trouble making this work, which I think it might be worth telling you about. Two warnings:

Firstly:

Severity: Warning
Message: preg_match_all() [function.preg-match-all]: Compilation failed: nothing to repeat at offset 1
Filename: PDF/tcpdi_parser.php
Line Number: 1056

Those warnings don't appear to be showstoppers. The one that I get when the DOMPDF fails to import is now this:

Severity: Warning
Message: Invalid argument supplied for foreach()
Filename: PDF/tcpdi_parser.php
Line Number: 259

getObjectVal is returning an array of integers: { [0]=> int(8) [1]=> int(1) [2]=> int(0) }.

Since that function calls findObjectOffsets() a number of times (wherein lives our dubious regex) it seems plausible that the failure of that regex is at fault here.

I did some more research and it says on http://stackoverflow.com/questions/6814250/how-to-change-what-pcre-regexp-thinks-are-newlines-in-multi-line-mode that the (*ANYCRLF) modifier is only available from PCRE 7.3 which is PHP 5.2.5 - you might want to put this somewhere as a dependency! We're running PHP 5.3.3 but only PCRE 6.6, and don't seem to be able to update (probably to do with unicode support or somesuch).

If I remove (*ANYCRLF) from the regex then it seems to work for all the PDFs on which I've tried it so far. I'd be interested to hear your thoughts on doing this. It's a hell of a lot easier than moving server.

pauln commented 10 years ago

As far as I can tell, it was actually added in PCRE 7.1 (which is still three years older than the version bundled with PHP 5.3.3 (PCRE 8.02); PCRE 6.6 is more than a year older than that!).

If you're absolutely stuck on PCRE 6.6, removing the (ANYCRLF) should be fine as long as all of the PDFs you're reading in use unix-style newlines (\n). Of a small sample of PDFs I have on hand, there are a good mix of \n and \r newlines, so only somewhere in the region of half will work without (ANYCRLF) - but if all of the PDFs you'll be reading in with tcpdi_parser are being generated by DOMPDF, you'll be fine as it always uses \n. If you need to support PDFs with both styles of newline, you might be able to "sanitise" the data (replace \r with \n) before feeding it into tcpdi_parser, but I'll leave that to you.