smalot / pdfparser

PdfParser, a standalone PHP library, provides various tools to extract data from a PDF file.
GNU Lesser General Public License v3.0
2.41k stars 537 forks source link

Fix for adjacent escaped slashes and escaped parentheses in strings #711

Closed GreyWyvern closed 5 months ago

GreyWyvern commented 6 months ago

Type of pull request

About

The current (string) replacement regexp in formatContent() only backchecked two characters for escaped slashes, so if an escaped slash immediately preceded an escaped parenthesis, the script would incorrectly interpret it as an escaped slash and an unescaped parenthesis. This would lead to the loop never finding the "end" of the string (for an open parenthesis) or finding the end of the string prematurely (for a close parenthesis).

Perform a string replace to get rid of all escaped slashes and then escaped parentheses; they aren't needed when just checking for balanced, unescaped parentheses. Also add removing slashes to the inline images section above for the same reason.

Resolves #709.

Checklist for code / configuration changes

In case you changed the code/configuration, please read each of the following checkboxes as they contain valuable information:

k00ni commented 6 months ago

@huihuangjiuai Does this fix #709 for you?