Open skius opened 1 year ago
Does Any-Hex
exist as a rule file as well? I.e. is implementing it in code merely a performance optimisation?
Does Any-Hex exist as a rule file as well?
Not in the usual place, so if it did, I wouldn't know where.
I.e. is implementing it in code merely a performance optimisation?
All code based transliterators are merely for performance reasons + saved human implementation time, as transform rules can implement arbitrary transforms.
(In the specific case of Any-Hex, it should even be fairly simple to generate rule files for them. I'm not sure if this also applies to NFC, etc)
There are open PRs (#3946, #3965) that add support for many such transliterators:
icu_normalizer
)These make most of CLDR data usable, and can serve as examples for implementing the remainder. Notably still missing for full CLDR support:
icu_casemap
ICU supports more than those. See the ICU4J directory for a full list.
There are a few rule-defined Upper/Lower/Title
transliterators for language-specific casemapping (e.g., Turkish). Our components support these in code, so we don't have to use the rule definitions and can instead use hardcoded transliterators.
Is it correct that Lower
was not yet implemented?
Is it correct that Lower was not yet implemented?
Correct! IIRC there are no dangling implementations, everything should be linked in load_special
https://github.com/unicode-org/icu4x/blob/6b5a69c44e387b7258112e85325cffcaf96b1b67/components/experimental/src/transliterate/transliterator/mod.rs#L341-L386
For feature parity with ICU we need some transliterators that ICU defines not using rule sources but in code. A good (maybe even complete) starting point is this directory: https://github.com/unicode-org/icu/tree/main/icu4j/main/classes/translit/src/com/ibm/icu/text
For example,
EscapeTransliterator.java
is responsible for the manyAny-Hex
variants that exist.Some transliterators also have related components in ICU4X, like
Any-NFC
, so those should be implemented by reusing the ICU4X components and data.Users can create these transliterators using BCP-47 IDs that are defined in #3909.