paquettg / php-html-parser

An HTML DOM parser. It allows you to manipulate HTML. Find tags on an HTML page with selectors just like jQuery.
MIT License
2.37k stars 461 forks source link

DOM Cleaner: mb_eregi_replace errors out with retry-limit-in-match #283

Open half0wl opened 3 years ago

half0wl commented 3 years ago

Reproduction:

>>> use PHPHtmlParser\Dom;
>>> $dom = new Dom;
>>> $dom->loadFromUrl("https://casper.com/gifts/?clickid=T02U6OVQYxyLUbdwUx0Mo36dUkB1HNWwiSMnwQ0");

Throws:

PHP Warning:  mb_eregi_replace(): mbregex search failure in php_mbereg_replace_exec(): retry-limit-in-match
over in <stripped>/paquettg/php-html-parser/src/PHPHtmlParser/Dom/Cleaner.php on line 81
PHPHtmlParser\Exceptions\LogicalException with message 'mb_eregi_replace returned false instead of a string.
Error when attempting to remove scripts 2.'

I've tried ini_set("pcre.backtrack_limit", "10000000000") after some Googlefu on the error, but it doesn't work.

I can reproduce this on pages with huge <script></script> tags, typically when there's a giant blob of JSON object in it.

Deewde commented 2 years ago

I have the exact same problem but with a different URL. I quick-fixed it by disabling script removal from the HTML with $dom->setOptions((new Options())->setRemoveScripts(false)); but I would rather have a real fix for this, especially because there's a warning that keeping script tags could have unforeseen consequences.

Any help on this issue please @paquettg ?

Deewde commented 2 years ago

Ok, I've fixed it without disabling tag removal by increasing the mb retry limit to 10 million. The self-documented php.ini describes this:

; This directive specifies maximum retry count for mbstring regular expressions. It is similar ; to the pcre.backtrack_limit for PCRE. ; Default: 1000000 ;mbstring.regex_retry_limit=1000000

so I've used

ini_set("mbstring.regex_retry_limit", "10000000");

and all works fine on this front now