tc39 / ecma262

Status, process, and documents for ECMA-262
https://tc39.es/ecma262/
Other
15.06k stars 1.29k forks source link

RegExp `\p`: `Unknown` value for `Script` / `Script_Extension` #3190

Open mathiasbynens opened 1 year ago

mathiasbynens commented 1 year ago

The following was reported to me by Nozomu Katō via email. I’m reposting it here with permission:

Because the ECMAScript specification 2023 ceased listing what values are valid for Script and Script_Extension of Unicode property escapes, ambiguity seems to have occurred about the handling of the "Unknown (Zzzz)" value.

Up to the previous version of the spec, this value had been excluded from the table that shows what script names must be supported. However, recently I noticed that V8 does not throw an error when I type javascript:alert(/\p{sc=Unknown}/v.test('a')/) in the address bar. I came to think of the possibility that this value might not have been excluded but have been missing from the table accidentally, perhaps because the ranges are not explicitly described in Scripts.txt. But the online demo of your regexpu seems not to accept /\p{sc=Unknown}/. I am confused...

Although I personally do not think /\p{sc=Unknown}/ is useful because it looks like doing the same thing as /\p{Unassigned}/, this ambiguity might hurt interoperability. So, I would like this issue to be addressed in the spec in any way.

[…]

Incidentally, using only PropertyValueAliases.txt for enumerating script names is probably a dangerous option. This file lists Katakana_Or_Hiragana (Hrkt), which is likely to mean /[\p{sc=Katakana}\p{sc=Hiragana}]/, but if so, using this as a Script value seems to violate the "every Unicode code point is assigned a single Script property value" rule: https://www.unicode.org/reports/tr24/#Script_Values

According to my own check, while PropertyValueAliases.txt enumerates 165 script names Scripts.txt lists code point ranges for 163 script names, as of Unicode 15.0.0. -2 are Unknown and Katakana_Or_Hiragana.

So, supporting the script names listed in both PropertyValueAliases.txt and Scripts.txt, with or without Unknown, may be a safer option.

michaelficarra commented 1 year ago

ambiguity seems to have occurred about the handling of the "Unknown (Zzzz)" value

I don't think there's ambiguity. This value is listed in PropertyValueAliases.txt, therefore it is a valid value for Script.

Incidentally, using only PropertyValueAliases.txt for enumerating script names is probably a dangerous option. This file lists Katakana_Or_Hiragana (Hrkt), which is likely to mean /[\p{sc=Katakana}\p{sc=Hiragana}]/, but if so, using this as a Script value seems to violate the "every Unicode code point is assigned a single Script property value" rule:

This issue should be taken up with the Unicode Consortium, not us. But given their alias stability policy (which I personally advocated for on behalf of TC39), this alias will never be removed, so I don't see anything that could be done about this, even if they wanted to.

According to my own check, while PropertyValueAliases.txt enumerates 165 script names Scripts.txt lists code point ranges for 163 script names, as of Unicode 15.0.0. -2 are Unknown and Katakana_Or_Hiragana.

Theoretically, we could get away with switching to only support aliases which have code points assigned in Scripts.txt (in effect dropping Unknown and Katakana_Or_Hiragana), but someone would have to do that web compatibility research and convince implementations that it's worth the risk just to prohibit an unwanted feature. Is that what's being proposed here?