unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

Transliterator is missing compiled data constructors #4890

Open kornelski opened 5 months ago

kornelski commented 5 months ago

The icu_transliterate crate has constructor methods that take:

P: DataProvider<TransliteratorRulesV1Marker> + DataProvider<CanonicalDecompositionDataV1Marker> + DataProvider<CompatibilityDecompositionSupplementV1Marker> + DataProvider<CanonicalDecompositionTablesV1Marker> + DataProvider<CompatibilityDecompositionTablesV1Marker> + DataProvider<CanonicalCompositionsV1Marker> + ?Sized,

This is a large complex-looking trait bound, which does not help me figure out what the actual type is needed here. Even when I try to browse the mentioned types like TransliteratorRulesV1Marker, it leads to more levels of abstraction, which all end at some HelloWorldProvider.

As a result, this API is so opaque and overly abstract, that I can't even tell if the transliterator is working at all, or is it a defunct stub that has no real implementations.

Some methods have shockingly large trait bounds. To me this is very discouraging, because I'm afraid that if I run into even slightest problems with this crate, I'll be faced with such wall of abstractions upon abstractions, requiring me to understand an abstraction with over 60 degrees of freedom, which seems absolutely an overkill when there's probably between 0 to 2 actual implementations of it.

sffc commented 5 months ago

Hi @kornelski!

Most functions with the large trait bounds have a link to this page which explains what they mean:

image

These bounds are what enable ICU4X to slice the data to exactly what is needed and avoid having a big monolithic data file as is required by ICU4C that doesn't have these bounds.

In general you should be using functions like RuleCollection::as_provider instead of RuleCollection::as_provider_unstable. Once you get things working with RuleCollection::as_provider, if you want to start building your own data, you can switch over to RuleCollection::as_provider_unstable, and the argument should be a reference to your custom data provider created as explained in this tutorial.

sffc commented 5 months ago

Actually I'll re-open this issue specifically for Transliterator because it looks like it doesn't currently have a compiled_data constructor.

sffc commented 5 months ago

Partial fix for the Transliterator constructor in https://github.com/unicode-org/icu4x/pull/4898

In the short term, you can get the type you need, the one that implements all the trait bounds, like this:

include!("tests/transliterate/data/baked/mod.rs");

where tests is the icu_experimental crate root which is components/experimental/tests in this repository. That file exposes a type named BakedDataProvider which is what you need.

It's not ideal that this is in a test directory, which is why I put up #4898. I think this is just an oversight from when we moved the code into the new icu_experimental crate.

robertbastian commented 5 months ago

tests/transliterate/data/baked contains some hand written tests, it does not contain CLDR data.