smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.3k stars 534 forks source link

Strengthen check for UTF-8 conformity in formatContent() #704

Closed GreyWyvern closed 2 months ago

GreyWyvern commented 2 months ago

Type of pull request

About

In some cases a binary string may pass as valid UTF-8 to the mb_check_encoding(..., 'UTF-8') function. Use a comprehensive regexp from the W3 group instead to be certain we aren't trying to parse binary content in formatContent(). In addition to (strings), also check for the beginning of ID inline image content sections, which may also contain binary. Resolves #668.

Reference: https://www.w3.org/International/questions/qa-forms-utf-8.en

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information: