smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Fix hexadecimal decoding (fixes #715) #716

Open krzyc opened 1 month ago

krzyc commented 1 month ago

Type of pull request

About

Fix for #715 as requested by @k00ni. Tests are passing but I am not confident that this will not break pdfparser because I lack knowledge of internal work of library and PDF format.

k00ni commented 1 month ago

There are syntax errors, please check: https://github.com/smalot/pdfparser/actions/runs/9259258917/job/25483604622?pr=716#step:6:13

krzyc commented 1 month ago

@k00ni I have fixed PHP 7.x incompatibility. Regarding testSpecialCharsEncodedAsHex test failed - I have still not found a method to fix it - I have added this test because during additional inspection I have discovered that there is more related problems with round brackets. But I think that PR is already in state that allows us to verify whether my approach is correct.

k00ni commented 1 month ago

Is there a reference in the PDF specification you referring to with this changes in general?

krzyc commented 1 month ago

Probably, but it was not intentional. Let it be "PDF Reference sixth edition" by Adobe Systems Incorporated (https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference1.7old.pdf) page 53: "Any characters may appear in a string except unbalanced parentheses and the backslash, which must be treated specially" e.g.

( Strings may contain balanced parentheses ( ) and
special characters ( * ! & } ^ % and so on ) . )

(Maybe tests should include this string instead of mine.)

I was mainly interested in reading Sig objects (https://www.adobe.com/devnet-docs/acrobatetk/tools/DigSig/Acrobat_DigitalSignatures_in_PDF.pdf), but pdfparser was truncating binary data so I was forced to dive into basics and found more problems as in attached tests.

My concern is I do not understand why pdfparser is parsing strings like this: https://github.com/smalot/pdfparser/blob/4b86c6636d086ca7ea4780c07c2d7390321982b5/src/Smalot/PdfParser/Element/ElementString.php#L62-L74 so I am not comfortable at modifying it.

I am looking for advice or help in fixing problem as I written earlier in #715.

krzyc commented 1 month ago

I have rewritten element string parser and now tests pass. I have two observations:

I am awaiting your feedback.

k00ni commented 1 month ago

Thanks for your effort! I try to give you feedback as soon as possible.

GreyWyvern commented 4 weeks ago

After an initial review, this appears to be an issue solely with ElementString::parse. There are zero changes required to the ElementHexa.php file IMHO. ElementHexa::parse sends ElementString::parse a perfectly valid string, and it's ElementString that proceeds to mess it up.

The only reason ElementHexa is involved at all is because it's a vector that allows this sort of tricky (but still valid!) string to reach ElementString, whereas normally ElementString would rarely see such strings.

ElementString::parse expects all internal parentheses to be escaped, and I would wager that in >99% of PDFs, they are. But according to the PDF Reference, balanced, unescaped parentheses are allowed. So this is a matter of just updating ElementString::parse to handle balanced, unescaped parentheses. I haven't fully examined the updated code in ElementString::parse in this PR, but as long as it does that, there should be no other edits necessary to ElementHexa.php.

Edit: In fact, when I update the test hexadecimal string to '\(\(\\\\' the output is '()\', not truncated at all, because all the parentheses were escaped as ElementString::parse expected.

k00ni commented 1 week ago

What are your plans here @krzyc?

As mentioned, the change in ElementHexa does not seem to be required.

GreyWyvern commented 1 week ago

I've been thinking about this some more and it's a little more complicated than it seems at first glance.

ElementString accepts (and currently expects) strings with escaped slashes and parentheses. 99% of the time strings from PDF document streams will be escaped, BUT unescaped, balanced parentheses are allowed. Unescaped backslashes are NOT allowed; according to the PDF reference if they aren't followed by a valid escape character, they should be ignored.

And that is the problem with strings from ElementHexa. It passes a completely unescaped (but valid) string, but if it contains literal backslashes, then further parsing it will remove them or result in an incorrect special character.

Strings that originate from ElementHexa should be passed to ElementString to create the appropriate object, but ElementString::parse() should probably not be called since the string is already unescaped. Or if necessary, a new function to process already unescaped strings can be added.