Open mvorisek opened 1 year ago
https://github.com/php/php-src/commit/2b9acd37f0a13572684dde80e3e56d5c1b2ec045
We currently have a large performance problem when implementing lexers working on UTF-8 strings in PHP. This kind of code tends to perform a large number of matches at different offsets on a single string. This is generally fast. However, if /u mode is used, the full string will be UTF-8 validated on each match. This results in quadratic runtime.
This patch fixes the issue by adding a IS_STR_VALID_UTF8 flag, which is set when we have determined that the string is valid UTF8 and further validation is skipped.
A limitation of this approach is that we can't set the flag for interned strings. I think this is not a problem for this use-case which will generally work on dynamic data. If we want to use this flag for other purposes as well (mbstring?) then it might be worthwhile to UTF-8 validate strings during interning. But right now this doesn't seem useful.
As we do now use this flag in mbstring I suppose this makes sense. Although I still suspect that mbstring primarily operates on dynamic and thus non-interned strings.
Is also that related #10409 ? That thread talk about mb_check_encoding
adding flag IS_STR_VALID_UTF8
when UTF-8 string is valid, but it isn't still worked.
@youkidearitai Yes, string that appear in your source code would be marked as valid UTF-8 after implementing this. Thanks for pointing out that thread, updating the flag atomically is something I didn't consider.
Description
Currently the UTF-8 string validity is checked on demand and cached the string if not interned.
However:
This is a feature request to: a) check UTF-8 validity on compile time on every const string b) store a flag if UTF-8 validity was checked or not
Thanks to #10436 the UTF-8 validity check is very fast and the compile time impact should be minimal.