php / php-src

The PHP Interpreter
https://www.php.net
Other
38.17k stars 7.75k forks source link

Check UTF-8 validity for all constant strings on compile time #10853

Open mvorisek opened 1 year ago

mvorisek commented 1 year ago

Description

Currently the UTF-8 string validity is checked on demand and cached the string if not interned.

However:

This is a feature request to: a) check UTF-8 validity on compile time on every const string b) store a flag if UTF-8 validity was checked or not

Thanks to #10436 the UTF-8 validity check is very fast and the compile time impact should be minimal.

iluuu1994 commented 1 year ago

https://github.com/php/php-src/commit/2b9acd37f0a13572684dde80e3e56d5c1b2ec045

We currently have a large performance problem when implementing lexers working on UTF-8 strings in PHP. This kind of code tends to perform a large number of matches at different offsets on a single string. This is generally fast. However, if /u mode is used, the full string will be UTF-8 validated on each match. This results in quadratic runtime.

This patch fixes the issue by adding a IS_STR_VALID_UTF8 flag, which is set when we have determined that the string is valid UTF8 and further validation is skipped.

A limitation of this approach is that we can't set the flag for interned strings. I think this is not a problem for this use-case which will generally work on dynamic data. If we want to use this flag for other purposes as well (mbstring?) then it might be worthwhile to UTF-8 validate strings during interning. But right now this doesn't seem useful.

As we do now use this flag in mbstring I suppose this makes sense. Although I still suspect that mbstring primarily operates on dynamic and thus non-interned strings.

youkidearitai commented 1 year ago

Is also that related #10409 ? That thread talk about mb_check_encoding adding flag IS_STR_VALID_UTF8 when UTF-8 string is valid, but it isn't still worked.

iluuu1994 commented 1 year ago

@youkidearitai Yes, string that appear in your source code would be marked as valid UTF-8 after implementing this. Thanks for pointing out that thread, updating the flag atomically is something I didn't consider.