smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.37k stars 538 forks source link

Adjust hex/octal string decoding #627

Closed GreyWyvern closed 1 year ago

GreyWyvern commented 1 year ago

Add a second check to be sure a string is hexadecimal before applying the pack() function. This ensures we avoid illegal hex digit and resolves https://github.com/smalot/pdfparser/issues/499

PdfParser currently only decodes triple digit escaped octal codes, when single, double and triple digits are all allowed. See PDF Reference 1.7 Section 3.2 Objects (page 55): https://ia801001.us.archive.org/1/items/pdf1.7/pdf_reference_1-7.pdf

Modify the regexp to search for escaped octal codes from one to three digits, and exclude escaped backslashes. In sections of text that aren't escaped octal codes, un-escape backslashes and parentheses as described in PDF Reference 1.7 Section 3.2 Table 3.2. This resolves https://github.com/smalot/pdfparser/issues/470

Adjust the unit test testDecodeOctal() to escape the valid octal code \\1 so that the output matches the existing expected value AB \199.