Implement hardcoded ICU transliterators

unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.

https://icu4x.unicode.org

Other

1.37k stars 175 forks source link

Implement hardcoded ICU transliterators #3910

Open skius opened 1 year ago

skius commented 1 year ago

For feature parity with ICU we need some transliterators that ICU defines not using rule sources but in code. A good (maybe even complete) starting point is this directory: https://github.com/unicode-org/icu/tree/main/icu4j/main/classes/translit/src/com/ibm/icu/text

For example, EscapeTransliterator.java is responsible for the many Any-Hex variants that exist.

Some transliterators also have related components in ICU4X, like Any-NFC, so those should be implemented by reusing the ICU4X components and data.

Users can create these transliterators using BCP-47 IDs that are defined in #3909.

robertbastian commented 1 year ago

Does Any-Hex exist as a rule file as well? I.e. is implementing it in code merely a performance optimisation?

skius commented 1 year ago

Does Any-Hex exist as a rule file as well?

Not in the usual place, so if it did, I wouldn't know where.

I.e. is implementing it in code merely a performance optimisation?

All code based transliterators are merely for performance reasons + saved human implementation time, as transform rules can implement arbitrary transforms.

skius commented 1 year ago

(In the specific case of Any-Hex, it should even be fairly simple to generate rule files for them. I'm not sure if this also applies to NFC, etc)

skius commented 1 year ago

There are open PRs (#3946, #3965) that add support for many such transliterators:

Any-Hex/{many variants} - custom code-based implementations
Any-{NFC, NFD, NFKC, NFKD} - existing ICU4X component-based implementations (based on icu_normalizer)
Any-Remove/Any-Null - trivial implementations

These make most of CLDR data usable, and can serve as examples for implementing the remainder. Notably still missing for full CLDR support:

Any-{Upper, Lower, Title} - can probably use icu_casemap
Any-BreakInternal - some legacy thing, likely a mix of code based and component based

ICU supports more than those. See the ICU4J directory for a full list.

skius commented 1 year ago

There are a few rule-defined Upper/Lower/Title transliterators for language-specific casemapping (e.g., Turkish). Our components support these in code, so we don't have to use the rule definitions and can instead use hardcoded transliterators.

sffc commented 2 months ago

Is it correct that Lower was not yet implemented?

skius commented 2 months ago

Is it correct that Lower was not yet implemented?

Correct! IIRC there are no dangling implementations, everything should be linked in load_special https://github.com/unicode-org/icu4x/blob/6b5a69c44e387b7258112e85325cffcaf96b1b67/components/experimental/src/transliterate/transliterator/mod.rs#L341-L386