unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.33k stars 173 forks source link

Provide the Numeric_Value character property #3014

Open hsivonen opened 1 year ago

hsivonen commented 1 year ago

ICU4X is missing an API for querying the Numeric_Value property of a character.

Use case: Gecko uses this property to classify the sameness of numbering systems for the purpose of IDNA confusability. When a character whose general class is decimal digit is encountered, the Numeric_Value property is used for computing the zero character in that numbering system, which is then used to remember the first numbering system encountered. If another decimal digit results in a different corresponding zero character, multiple numbering systems are considered to be present and a confusability risk is concluded.

sffc commented 1 year ago

These need to be added to icuexport before they can be added to icu4x.

https://unicode-org.atlassian.net/browse/ICU-22284

markusicu commented 1 year ago

Note that Numeric_Value is easy when Numeric_Type=Decimal or Numeric_Type=Digit. And maybe you need/want it only if Numeric_Type=Decimal.

When Numeric_Type=Numeric, then the Numeric_Value can be negative, huge, or a fraction. These are rarely useful. https://www.unicode.org/reports/tr44/#Numeric_Value

I would start with an API that returns the value of a decimal digit.

hsivonen commented 5 months ago

Gecko indeed only wants the numeric value if type == U_NT_DECIMAL || type == U_NT_DIGIT.

Indeed, it would make sense to have an API that returns the decimal value of a character if the character has such a value in the 0 to 9 range and is part of a range of ten consecutive characters whose values range from 0 to 9. (Is the second condition already a prerequisite for a character to have a Numeric_Value in the 0 to 9 range?)

It seems to me that icuexportdata dumps Numeric_Type but does not dump any form of Numeric_Value currently. Dumping the data probably needs a special case similar to how script extensions are dumped.

hsivonen commented 5 months ago

In Gecko's type == U_NT_DECIMAL || type == U_NT_DIGIT, it's unclear to me if the second half ever happens considering that the character is a character from the output of UTS 46 mapping. (That is, I'm not sure if any U_NT_DIGIT characters remain in UTS 46 mapping output.)

markusicu commented 5 months ago

Most of the nt=digit characters are not part of a contiguous 0..9 range of characters. In particular, there is often no zero. Some of them are simply nt=digit because their nv is 0..9 although they are part of a larger set of "numbered list bullets" where the nv>9 numbers have nt=numeric.

In UTS46, they are variously disallowed/mapped/valid.

See https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ant%3Ddigit%3A%5D&g=uts46&i=

It makes sense to me to have an API that returns the nv of nt=decimal but the nv of other characters is rarely useful to programmers. (I see that I am repeating myself.)

hsivonen commented 5 months ago

Thanks. From https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=%5B%3Ant%3Ddigit%3A%5D&g=uts46&i= it looks like what Gecko does actually works even for cases that don't have all 10 consecutive digits. Whether that means that Gecko really needs the numeric value also for nt=digit is still unclear to me. I'll need to check if the nt=digit cases are caught by another part of the IDNA display policy.

In particular, it seems problematic (in principle, not that circled digits occur in domains that much in practice) that UTS 46 allows different kinds of circled digits that are confusable with each other. (E.g. double-circled with single-circled in small sizes or serif with sans serif.)

markusicu commented 5 months ago

In particular, it seems problematic [...] that UTS 46 allows different kinds of circled digits that are confusable with each other.

UTS46 operates on principles for identifiers. Security and confusables is a separate topic in UTR36/UTS39.

sffc commented 4 months ago

This is a good second issue. It is a bit larger since it requires work on the ICU4C side. We need to first add the data to icuexportdata.zip:

https://unicode-org.atlassian.net/browse/ICU-22284

Once this is done, integrating it into ICU4X should be relatively simple if the data fits into existing data struct shapes. If we need to add new data structs, the ICU4X side could be a bit more challenging but still feasible.