Closed markusicu closed 10 months ago
Maybe the tool is missing the special logic for scx to use "contains" not "equals".
Possible proof: This shows the Bengali digits: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ascx%3DBengali%2CChakma%2CSyloti_Nagri%3A%5D&g=&i= This shows that the value with a different order of scripts is not recognized, and prints the known values: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ascx%3DChakma%2CBengali%2CSyloti_Nagri%3A%5D&g=&i=
Note that multi-script sets are printed with commas but no spaces between scripts.
Co-debug with @macchiati
Other useful links:
Compare sets: https://util.unicode.org/UnicodeJsps/unicodeset.jsp?a=[:sc=Beng:]&b=[:scx=Beng:]
"Vedic" characters with scx info: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AName%3D%2FVEDIC%2F%3A%5D&g=&i=scx
Another indication that scx=Beng does not work right: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B%3Asc%3Dbeng%3A%5D%5B%3Ascx%3Dbeng%3A%5D%5D&g=&i=sc+scx
Related: Mark's code refactoring idea in issue #195
Note: JSP UnicodeSet lookups for gc=__
when the value is a multi-category value (ex: L
, C
, ...) currently works, so there is already some special handling somewhere on a per-property basis in the JSPs code. Script_Extensions is another case where the =
operator isn't a strict equality but rather has some special meaning that is specific to the property. What is done here should be a model for (or extensible to) a bug for supporting the Age property (#54).
It isn't specific to the property; rather this is the case for any multivalued property.
\p{prop=abc} is equivalent to 'the set of all characters X such that prop(X) ∋ abc.
For single-valued properties, the interpretation is identical (treating the single value as a singleton set).
+1 ran into this problem with the Katana and Hiragana scripts vs "Hiragana,Katakana" script extension, it was very confusing and misleading until I figured out the bug, this should be given more priority, a lot of people tend to trust official tools as source of truth, this can lead to spread of misinformation on how Unicode Script_Extensions works and how it should be implemented.
The characters listed by \p{Script_Extensions=Hiragana,Katakana}
are expected to show up on \p{Script_Extensions=Katakana}
and \p{Script_Extensions=Hiragana}
but yet they are not listed and instead behave like the regular Script
property, completely nullifying the point of the Script_Extensions
property.
The https://util.unicode.org/UnicodeJsps/unicodeset.jsp tool also has the same problem and can more clearly display it too simply by comparing \p{Script=Katakana}
to \p{Script_Extensions=Katakana}
, they should NOT be equal, but yet the tool shows them as identical.
Interestingly enough the Regex tool understands Script_Extensions correctly as seen here:
The U+3031 character 〱
is a "Hiragana,Katakana" Script_Extensions character.
For reference the UTS18 correctly describes the expect behavior here: #Script_Property, including a very similar example to my own.
+1 ran into this problem with the Katana and Hiragana scripts vs "Hiragana,Katakana" script extension, ...
The characters listed by
\p{Script_Extensions=Hiragana,Katakana}
are expected to show up on\p{Script_Extensions=Katakana}
and\p{Script_Extensions=Hiragana}
but yet they are not listed and instead behave like the regularScript
property, completely nullifying the point of theScript_Extensions
property.
Actually, that particular syntax is neither documented nor intentionally supported. If you want the union of two scx values, then you need to use union syntax to do so, as in \p{Script_Extensions=Hiragana}\p{Script_Extensions=Katakana}
.
Reported as https://unicode-org.atlassian.net/browse/ICU-21892 but ICU UnicodeSet implements scx as intended (see the ticket comments).
In the JSPs,
[:scx=Deva:]
does not contain Danda and Double Danda, and[:scx=Beng:]
does not contain Bangla digits.Example: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7Bsc%3DBengali%7D%5D+-+%5B%5Cp%7Bscx%3DBengali%7D%5D&c=on&g=gc&i=