unicode-org / unicodetools

home of unicodetools and https://util.unicode.org JSPs
https://util.unicode.org
Other
52 stars 39 forks source link

UnicodeSet/property tools: Script_Extensions missing characters #192

Closed markusicu closed 10 months ago

markusicu commented 2 years ago

Reported as https://unicode-org.atlassian.net/browse/ICU-21892 but ICU UnicodeSet implements scx as intended (see the ticket comments).

In the JSPs, [:scx=Deva:] does not contain Danda and Double Danda, and [:scx=Beng:] does not contain Bangla digits.

Example: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5Cp%7Bsc%3DBengali%7D%5D+-+%5B%5Cp%7Bscx%3DBengali%7D%5D&c=on&g=gc&i=

markusicu commented 2 years ago

Maybe the tool is missing the special logic for scx to use "contains" not "equals".

Possible proof: This shows the Bengali digits: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ascx%3DBengali%2CChakma%2CSyloti_Nagri%3A%5D&g=&i= This shows that the value with a different order of scripts is not recognized, and prints the known values: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ascx%3DChakma%2CBengali%2CSyloti_Nagri%3A%5D&g=&i=

Note that multi-script sets are printed with commas but no spaces between scripts.

Co-debug with @macchiati

Other useful links:

Compare sets: https://util.unicode.org/UnicodeJsps/unicodeset.jsp?a=[:sc=Beng:]&b=[:scx=Beng:]

"Vedic" characters with scx info: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3AName%3D%2FVEDIC%2F%3A%5D&g=&i=scx

Another indication that scx=Beng does not work right: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%5B%3Asc%3Dbeng%3A%5D%5B%3Ascx%3Dbeng%3A%5D%5D&g=&i=sc+scx

markusicu commented 2 years ago

Related: Mark's code refactoring idea in issue #195

echeran commented 2 years ago

Note: JSP UnicodeSet lookups for gc=__ when the value is a multi-category value (ex: L, C, ...) currently works, so there is already some special handling somewhere on a per-property basis in the JSPs code. Script_Extensions is another case where the = operator isn't a strict equality but rather has some special meaning that is specific to the property. What is done here should be a model for (or extensible to) a bug for supporting the Age property (#54).

macchiati commented 2 years ago

It isn't specific to the property; rather this is the case for any multivalued property.

\p{prop=abc} is equivalent to 'the set of all characters X such that prop(X) ∋ abc.

For single-valued properties, the interpretation is identical (treating the single value as a singleton set).

en0ent1ty commented 1 year ago

+1 ran into this problem with the Katana and Hiragana scripts vs "Hiragana,Katakana" script extension, it was very confusing and misleading until I figured out the bug, this should be given more priority, a lot of people tend to trust official tools as source of truth, this can lead to spread of misinformation on how Unicode Script_Extensions works and how it should be implemented.

The characters listed by \p{Script_Extensions=Hiragana,Katakana} are expected to show up on \p{Script_Extensions=Katakana} and \p{Script_Extensions=Hiragana} but yet they are not listed and instead behave like the regular Script property, completely nullifying the point of the Script_Extensions property.

The https://util.unicode.org/UnicodeJsps/unicodeset.jsp tool also has the same problem and can more clearly display it too simply by comparing \p{Script=Katakana} to \p{Script_Extensions=Katakana}, they should NOT be equal, but yet the tool shows them as identical.

Interestingly enough the Regex tool understands Script_Extensions correctly as seen here:

The U+3031 character is a "Hiragana,Katakana" Script_Extensions character.

For reference the UTS18 correctly describes the expect behavior here: #Script_Property, including a very similar example to my own.

markusicu commented 1 year ago

+1 ran into this problem with the Katana and Hiragana scripts vs "Hiragana,Katakana" script extension, ...

The characters listed by \p{Script_Extensions=Hiragana,Katakana} are expected to show up on \p{Script_Extensions=Katakana} and \p{Script_Extensions=Hiragana} but yet they are not listed and instead behave like the regular Script property, completely nullifying the point of the Script_Extensions property.

Actually, that particular syntax is neither documented nor intentionally supported. If you want the union of two scx values, then you need to use union syntax to do so, as in \p{Script_Extensions=Hiragana}\p{Script_Extensions=Katakana}.