smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Filter ElementHexa::decode() of non-hex chars #687

Closed GreyWyvern closed 3 months ago

GreyWyvern commented 3 months ago

Type of pull request

About

Add a preg_replace() to ElementHexa::decode() so incoming strings are filtered of all non hexadecimal characters. Resolves #683.

Also remove check for the BOM (feff) if it exists. The existing function does a check for characters '00' at the beginning of the string to decide whether to 4-byte or 2-byte decode this string. It does not account for the 4-byte BOM and decodes such a string in a 2-byte fashion. It depends on further functions (in this case Parser::parseHeaderElement()) to repair the incorrectly decoded contents. Removing Checking for the BOM allows ElementHexa::decode() to return the correctly decoded contents the first time.

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information: