smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Closing round bracket encoded in hexadecimal format breaks parsing #715

Open krzyc opened 1 month ago

krzyc commented 1 month ago

Description:

Closing round bracket encoded in hexadecimal format breaks parsing - string is truncated. String is truncated here: https://github.com/smalot/pdfparser/blob/4b86c6636d086ca7ea4780c07c2d7390321982b5/src/Smalot/PdfParser/Element/ElementString.php#L62-L74 Because this is my first contact with pdfparser I probably have no competence to provide a safe patch.

Test

public function testHexadecimalEncodedBracket(): void
{
    $document = new Document();

    $testString = '()';
    $content = '<< /Contents <'.bin2hex($testString).'> >>';
    $header = Header::parse($content, $document);
    $this->assertEquals($testString, (string) $header->get('Contents'));
}

Expected output & actual output

Test should pass, but returns: Failed asserting that two strings are equal. --- Expected +++ Actual @@ @@ -'()' +'('

k00ni commented 1 month ago

@krzyc this looks decent enough to me to have a deeper look. Can you create a pull request with your changes + the test and we will discuss there how to proceed?

krzyc commented 1 month ago

@k00ni I have provided PR and I am awaiting suggestions. It works for my case (extracting binary Contents from Sig object). All tests are passing. Edited And there are more possible problems with round square parsing, I have added another test cases, which I think should pass.