Design a cohesive solution for supported locales

sffc commented 4 years ago

I very often see clients who want to use ICU as a default behavior, but fall back to custom logic if ICU does not support a given locale.

The main problem, of course, is that the locale fallback chain is an essential piece of whether or not a locale is supported. If you have locale data for "en" and "en_001", but request "en_US" or "en_GB", the answer is that both of those locales are supported, even though they both load their data from a fallback locale.

I'm not 100% confident, but I think the prevailing use case is that programmers want to know whether the locale falls back all the way to root. If it gets "caught" by an intermediate language, then that's fine, as long as we don't use the stub data in root.

ECMA-402 has the concept of supportedLocalesOf. Although it's not clear in the MDN, it appears that this method has the ability to check for locale fallbacks. This is better than ICU4C's behavior of getAvailableLocales, which returns a string list and requires the user to figure out how to do fallbacks and matching on that list.

We could consider whether this use case fits in with the data provider, or whether we want to put it on APIs directly.

sffc commented 4 years ago

Putting this on the Version 1 backlog. We still need help on this.

sffc commented 2 years ago

Here's my current thinking.

There are three levels of locale support in a data provider:

Stored Locales: Given a particular key, these are the locales that have an exact match without any fallbacking. This list is largely internal to the data provider. This is what we currently expose in the iterable data provider.
Resolved Locales: These are the locales that are expressly guaranteed to have an exact match over all keys that the provider supports. This is what can be publicly consumable to answer the question of "supported locales". Resolved locales should generally include the language-script plus zero or more regional variants.
Fallbackable Locales: This is the unbounded set of locales such that when vertical fallback is enabled, a resolved locale is reachable.

A particular provider must store metadata listing the resolved locales, which increases the complexity. I don't see a way around this. In the fs provider, the list is stored in the manifest.json file, and in the blob provider, it is stored alongside the big zeromap.

The list is could be exposed through the following trait:

pub trait SupportedLocales {
    /// Returns resolved locales or MissingResourceKey if the key is not supported
    fn supported_locales_for_key(&self, key: ResourceKey) -> Result<Vec<ResourceOptions>, DataError>;
}

In ForkByKeyProvider, the implementation works the same as load_payload: we loop over the providers until finding one that doesn't return MissingResourceKey.

To answer the ECMA-402 question "what are the supported locales for NumberFormat", you use this API with the key decimal/symbols@1. When we add more versions of these data keys, we should likely loop over all versions of the key that our code supports and take the union.

Another mode that aligns closer to ECMA-402 is

pub trait SupportedLocales {
    /// Returns the subset of locales that are fallbackable locales.
    fn supported_locales_of(&self, key: ResourceKey, locales: Vec<ResourceOptions>) -> Result<Vec<ResourceOptions>, DataError>;
}

ForkByKeyProvider would again look for the first provider that supports the key, and then return its result.

An issue here is that vertical fallback would need to be invoked. Therefore, rather than having this API, it may be better and cheaper to just run the full vertical fallback stack, but stop short of deserializing/downcasting.

sffc commented 2 years ago

I propose considering what I proposed above as the course of action, and removing this from the 1.0 critical path.

jedel1043 commented 1 year ago

I've been implementing the Intl spec for the past couple of months, and I can give a bit of my perspective on this issue.

From what I could see while implementing the locale resolution algorithms defined in the ECMA402 spec, what the API seems to try to accomplish is to determine if a specific locale will return "correct" results if used as the locale of a specific service. Everything else is just taking that and extending it to several different APIs that filter/choose/tune a list of user-provided locales to ensure that all locales passed to the services are always "valid" in a sense.

Maybe this means that the providers don't explicitly need a SupportedLocales feature, but more like a way to pass them a locale and a key/service to know if that key/service using that locale will return "correct" results.

hsivonen commented 1 year ago

Collator is special in the sense that in the absence of natural-language output, und is more applicable as fallback than for services that involve natural-language output. Currently, if you request e.g. en, the collator in ICU4X falls back to und, which is correct in terms of comparison behavior, but existing Web-exposed behavior of Intl.Collator is that languages for which the root collation is known to be valid (without reordering), such as en, fr, etc., are supposed to behave in the outward API as if language-specific data existed for them. Furthermore, in Firefox and Chrome 1) und is treated as unsupported by supportedLocalesOf and 2) locales with actually-unsupported subtags are treated as supported by supportedLocalesOf if the language counts as supported.

console.log(Intl.Collator.supportedLocalesOf(['und', 'und-u-co-emoji', 'ban', 'id-u-co-pinyin', 'de-ID', "en", "fr", "el"])); logs the array [ "id-u-co-pinyin", "de-ID", "en", "fr", "el" ] in both Firefox and Chrome.

(In Safari, the array is prepended with "en-US-u-va-posix", which is just weird. Safari turns "und" (but not "und-u-co-emoji") into that, even though in CLDR the POSIX variant of English is not the same as the root. Weird.)

Boa currently logs the whole input array, because of "// TODO: ugly hack to accept locales that fallback to "und" in the collator/segmenter services".

It's not clear to me how "Resolved Locales" above should capture the "root is known to be valid" concept. As I understand it, we don't currently store this information. To the extent the provider infrastructure supports aliases, I guess one possibility would be list en, fr, etc. as aliases of und.

sffc commented 1 year ago

Based on feedback from @anba on https://github.com/tc39/ecma402/issues/830, one approach which would solve this problem fairly cleanly would be for datagen to record which locales were used when generating data, and then use that set of locales across components as the availableLocales in ECMA-402. With this model, it doesn't matter if individual data keys resolve to different locales. So long as the locale was included at datagen time, we know that it resolves to valid data.

Here are a few forms this solution could take:

Add a new trait or data key that retrieves the list of locales.
- Pro: Easy to implement
- Con: Unclear behavior when multiple source data providers are present.
- Con: Doesn't discriminate based on different supported locales by data key
Add functions on the source providers to retrieve the list of locales.
- Pro: Data exposed very close to the source.
- Con: Need to invent a solution for each individual provider (fs, blob, and bake).
Have datagen optionally save the resolved locales to its own data file and let the client figure out how to plumb that into where they need it.
- Pro: Easy to implement
- Con: More challenging for clients

robertbastian commented 1 year ago

Add a new trait or data key that retrieves the list of locales. Pro: Easy to implement Con: Unclear behavior when multiple source data providers are present.

Combining multiple providers needs to be solved for all three cases, and this is probably the cleanest for both clients and us. The only trip-up would be ForkByKeyProvider, which we can specialise to return empty and log a warning. The bigger issue here I think is where would this data struct be defined, and where would compiled data for it be included? It would probably have to be icu_provider, which has been data-free so far.

hsivonen commented 1 year ago

I understand how recording the datagen-time locales would allow for a) implementing "lookup" and b) filtering a list of OS-provided preferred languages to pick the most-preferred system language that is also an ICU4X available locale (for the purpose of computing an ECMA-402-compatible notion of host locale).

It's not immediately clear to me how storing this list would allow the resolved locale concept of ECMA-402 to be implemented in the case where "best fit" means delegation to ICU4X's own matching of request to available data.

AFAICT, implementing "best fit" on top of ICU4X requires being able to figure out a) whether there was a fallback all the way to root such that root isn't known-valid for the requested locale (so that the ECMA-402 glue code can proceed to trying the next locale on the list of requested locales or the default locale if the last locale on the request list fell back all the way to root without the root being known to be valid for the request) and b) if ICU4X didn't fall back all the way to root or fell back to root such that the root is known to be valid for the request, which locale would be the ECMA-402 resolved locale.

How would these be implemented given this list?

(Examples worth considering: For the collator requesting fr or de is known to be equivalent to requesting und. However, requesting fr-CA or de-u-co-phonebk are not equivalent to requesting und. If the requested locale has fr-CA plus some other subtags or de-u-co-phonebk with some other subtags, a) is it possible for the ICU4X fallback to end up ignoring -CA or -u-co-phonebk ending up with und-equivalent and, more generally, b) how should the ECMA-402 glue code figure out the resolved locale if the request had extra subtags and the datagen-time recording of processed locales contains fr-CA and de-u-co-phonebk?)

robertbastian commented 1 year ago

I think this would require the hybrid data mode that Shane mentioned in the ECMA issue. fr and de would have to be explicitly included as keys that point to und data, whereas abc would not be a key and would fall back to und data, which is a difference we can detect.

sffc commented 1 year ago

Discuss with:

@sffc
@robertbastian
@younies
@Manishearth

sffc commented 1 year ago

Another possible solution:

Add a data key that contains a unique entry for each locale. For example, supported@1/en-GB which contains a single string "en-GB" so it doesn't get deduplicated.
- Pro: Easy to implement
- Pro: Works with multiple source data providers
- Pro: The key can be automatically included or excluded based on the use of a supported locales API without any additional infrastructure work
- Con: Doesn't discriminate based on different supported locales by data key

An advantage of (4) is that you can query the data provider and get back the resolved supported locale. For example, you can request "it-JP" and get back "it" (if there was no CLDR data for "it-JP"), or you can request a non-Basic locale such as "arc" and get back "und" which means that the locale is not "supported".

sffc commented 1 year ago

What crate should this go into? I lean toward putting it with the rest of the fallback-type code in icu_locid_transform.

Manishearth commented 1 year ago

locid makes sense. We could also have a custom provider that stores all locales loaded that you can mutate at runtime and query stuff like this, to keep track of what you have and haven't loaded.

This key solution also makes it easier to reason about data generated in fallback mode.

The potential footgun is of course if you generate data for different sets of locales for different keys. Should be fine to just clearly document.

sffc commented 1 year ago

How does this look?

#[icu_provider::data_struct(SupportedLocaleV1Marker)]
pub struct SupportedLocaleV1<'data> {
    pub locale: Cow<'data, [u8]>,
}

pub struct SupportedLocale {
    data: DataPayload<SupportedLocaleV1Marker>,
}

impl SupportedLocale {
    pub fn load_unstable<P: DataProvider<SupportedLocaleV1Marker>>(
        provider: &P, locale: &DataLocale
    ) -> Result<Self> { ... }

    pub fn is_und(&self) -> bool { ... }

    pub fn to_locid(&self) -> LanguageIdentifier { ... }

    pub fn to_locale(&self) -> Locale { ... }
}

Manishearth commented 1 year ago

sgtm

sffc commented 1 year ago

@robertbastian to make a counter-proposal.

jedel1043 commented 1 year ago

Another option would be to directly store the locale, and just use the marker to load the supported locale:

#[icu_provider::data_struct(SupportedLocaleV1Marker)]
pub struct SupportedLocaleV1<'data> {
    pub locale: Cow<'data, [u8]>,
}

pub struct SupportedLocale {
    locale: DataLocale,
}

impl SupportedLocale {
    pub fn load_unstable<P: DataProvider<SupportedLocaleV1Marker>>(
        provider: &P, locale: DataLocale
    ) -> Result<Self> {
        let response = provider.load(DataRequest { locale: &locale, ..Default::default() })?;

        Ok(Self {
            locale: response.metadata.locale.unwrap_or(locale),
        })
    }

    pub fn is_und(&self) -> bool { ... }

    pub fn to_locid(&self) -> LanguageIdentifier { ... }

    pub fn to_locale(&self) -> Locale { ... }
}

sffc commented 1 year ago

I'm concerned about relying on response.metadata.locale because

We have a tutorial example that suggests it is okay to remove the locale from the response metadata
The resolved locale might not always be set in the metadata, such as when reading from a blob in hybrid mode

I acknowledge that my proposed solution deals with locales as strings when it feels like we should deal with them as upgraded objects. However:

The primary use case is ECMAScript supportedLocalesOf, which returns strings, not upgraded types
The other use case is detecting when a locale is not supported at all (falling back to root), in which case we can string-compare "und"
Comparing to the locale string is cheap thanks to cmp_bytes

In other words, we don't have a clear use case where we actually need the upgraded type.

That said, I am okay with caching the DataLocale from either the DataRequest or the DataResponseMetadata if it equals the locale in the datapayload as a performance optimization. But I see this as an internal change, not an architectural one.

jedel1043 commented 1 year ago

In other words, we don't have a clear use case where we actually need the upgraded type.

Um, using the upgraded type would make it easier to implement the BestFitMatcher operation from ECMA-402, since that needs to execute fallback to get the best supported locales from a list of requested locales (to_locale technically also provides the same ease of use, but that would rely on always having a valid BCP-47 string, whereas doing the parsing on load_unstable allows us to throw an error there).

Leaving that aside, if we don't want to rely on response.metadata.locale, we can just modify load_unstable:

    pub fn load_unstable<P: DataProvider<SupportedLocaleV1Marker>>(
        provider: &P,
        locale: DataLocale,
    ) -> Result<Self, DataError> {
        let (metadata, payload) = provider
            .load(DataRequest {
                locale: &locale,
                ..Default::default()
            })?
            .take_metadata_and_payload()?;
        let supported_locale = &payload.get().locale;

        match metadata.locale {
            Some(loc) if loc.strict_cmp(supported_locale).is_eq() => {
                Ok(SupportedLocale { locale: loc })
            }
            None if locale.strict_cmp(supported_locale).is_eq() => {
                Ok(SupportedLocale { locale })
            }
            _ => {
                let locale = Locale::try_from_bytes(supported_locale).map_err(|e| {
                    DataError::custom("supported locale was not a valid BCP-47 string")
                        .with_display_context(&e)
                })?;

                Ok(SupportedLocale { locale: locale.into() })
            }
        }
    }

robertbastian commented 1 year ago

Sorry I haven't had time to write out a proper solution, and I won't today. Just one thing I noticed

The resolved locale might not always be set in the metadata, such as when reading from a blob in hybrid mode

The resolved locale is not set in the metadata iff it's the request locale (or that one tutorial that we'd have to update). So doing response.metadata.locale.unwrap_or_else(|| req.locale.clone()) will always work, including with blob providers in hybrid mode.

jedel1043 commented 1 year ago

The resolved locale is not set in the metadata iff it's the request locale (or that one tutorial that we'd have to update). So doing response.metadata.locale.unwrap_or_else(|| req.locale.clone()) will always work, including with blob providers in hybrid mode.

Yeah, that was my understanding too, and I'm pretty sure we use that property on Boa to resolve locales. Maybe documenting this would be enough to ensure we can rely on the metadata.

zbraniecki commented 11 months ago

I'm currently porting fluent-rs stack to icu4x and encountered this issue in a context different from traditional ICU LocaleMatcher or ECMA-402 one.

In Fluent, we intentionally diverged from LocaleMatcher and implemented a different language negotiation heuristics as part of fluent-langneg.

The basic of heuristic tho is similar and it also requires a list of locales that the data is available for. In Fluent world, this list is fed, together with requested locales list to produce the result list that is then used (potentially after nested negotiation for complex scenarios) as a base locale list for I18n. Mozilla documentation provides an outline of the model.

For my use case, I need to be able to collect/flatten a list of locales available for a considered list of components into a union.

For a simple example, I may have list of Fluent resources available in 20 locales. I have a list of them. Now I want to learn what PluralRules locales are available, take the intersection of those and negotiate it against user requested locales.

I do this because I want to make sure I do not use Fluent resources for locale that we have no PluralRules for. This example scales to DateTImeFormat, NumberFormat and others.

So in the ideal world I do something like this:

fn bootstrap_locales(): Vec<LanguageIdentifier> {
    let fluent_locales = get_available_fluent_locales();
    let plural_locales = PluralRules::get_available_locales();
    let number_locales = NumberFormat::get_available_locales();

    let available_locales = union(&[fluent_locales, plural_locales, number_locales]);

    let negotiated = negotiate_languages(
        get_requested_locales(),
        available_locales,
        last_fallback_locale,
        NegotiationStrategy::Filtering,
    );
    return negotiated;
}

let bundle  = L10nRegistry.get_bundles(negotiated_locales, fluent_resources);

The bootstrapping happens rarely, at startup and at requested/available locales change. get_bundles may happen often and it is guaranteed to operate on locales that are available.

I understand that the complexity here is that there's no component locales really - much like with l10n resources it may depends on what resources I want. menu.ftl may be in 50 locales, but sidebar.ftl may be in 15. Cardinal keys may be in 50, Ordinal may be in 10.

So what we may want here is an ability to pass the same parameters as we would to a constructor, but instread of creating it, we'd just ask the constructor to ask DataProvider for the locales for the right set of keys and return them.

This would still require a bit of convoluted logic, like in my case I'd need to ask for locales for cardinal and ordinal and use the intersection of those. Not sure if it's worth separate API path to ask for wider selection of options that are exclusive in a constructor.

sffc commented 11 months ago

@zbraniecki The latest thinking (exact API shape yet to be decided) is that the supported locales list is based on the locales for which data was generated. So even if a locale doesn't have an ordinals key (perhaps because it inherits from root), it may be included in the supported locales list. Does that work for your use case?

Also, I don't know if returning a list is on the table. Lists are bad because they don't work well with regional variants. ECMA-402 supports returning a subset of the given input list instead.

zbraniecki commented 11 months ago

I have three reasons for which I believe we should allow for locale resolution to happen explicitly and outside of the constructor.

Let me explain them.

Starting point is an ergonomic ECMA-402 API - a high level API that performs operations implicitly:

class DateTimeFormat {
    constructor(locales, options) {
      let resolvedOptions = resolveOptions(locales, options);
      let resolvedLocale = selectLocale(locales, resolvedOptions);
      this.data = getDataForOptions(resolvedLocale, resolvedOptions);
    }

   static supportedLocalesOf(requestedLocales, options) {
      let availableLocales = getLocalesWeHaveDTFDataFor(options);
      return negotiateLocales(availableLocales, requestedLocales);
   }
}

In a lower level API like ICU4X, I believe we should externalize both operations from the constructor.

Reason 1 - externalization of options resolution requires externalization of locales resolution

With icu_preferences for ICU4X 2.0 I'm externalizing options resolution, but the input is a list of locales, so we need to select which locale we're working with to resolve the options.

Example:

let requestedLocales = &["en-US-u-hc-h12", "de-CA-u-hc-h23"];
let availableLocales = ???;
let lid = selectLID(requestedLocales, availableLocales);
let resolvedOptions = optionsBag.merge(locale);

let dtf = DateTimeFormat::new(lid, resolvedOptions);

I support MacOS model (Windows IIRC plasters the same Unicode Extensions on all requested locales, MacOS allows customer to specify UE per locale). That means that by the time we want to merge options bag with a locale, we need to know which locale we're working with.

Reason 2 - Locale resolution can be customizable.

One of the major architectural values of externalization (as a tradeoff for ergonomics) is that it allows customers to write their own options merging logic, or locale resolution. In ECMA-402 this is replaced with an option localeMatcher strategy passed to options bag, but this approach does now allow for building custom matching logic. If we want to enable customers to design their own approaches, we need to allow them to pass a resolved locale, not a requested locale.

Reason 3 - Chained negotiation

In many larger systems language negotiation is a chained operation. We're looking for the best locale that we have data for in many areas. For example, ideally, a multi-modal software may have to negotiate between:

Requested locales list
Available localization resources list
Available TTS models
Available ASR models
Available Plural Rules
Available DateTimeFormat patterns
Avialable NumberFormat patterns

In the world of monolithic software we were able to force release models that required that all locales in all of those were aligned. Which basically reduced this complexity to a single list of locals that all assets are available in negotiated against a single list that customer requested. In some cases the latter was limited to be a subset of the former, problem solved.

But this puts enormous stress on release models, forcing monolithic architectures like ICU4C where all data for all pieces is bundled and shipped and distributed and stored together.

In a more flexible model, I would like to allow customers to retrieve sufficient data to perform their own logic to select the optimal locale to use. This may mean that all of the available lists are intersected, and only that is negotiated. Or maybe some items are part of the negotiation (say, plural rules and number format), while others are not (date time format missing data should fallback down to und if needed).

But the gist is that we should not assume that ICU4X is the only part of the UI locale negotiation and allow a wider negotiation to be performed that includes ICU4X available data in its own negotiation.

The tricky piece here is that we don't always know which options will be used, so we can't fail when key is missing - my software may have checked that date time patterns are available in a negotiated locale, but if at runtime I ask for Chinese era months in the negotiated locale, and its missing, then a fallback has to happen, rather than catastrophic failure.

I think for such scenario an ideal API would be something like this:

impl DateTimeFormat {
    pub fn new(locale, options) -> Self;
    pub fn get_data_keys(options): Vec<DataKey>;
    pub fn get_available_lids_for_options(options): Vec<LanguageIdentifier>;
}

impl DataProvider {
    pub fn get_available_lids_for_keys(&[DataKey]);
}

The former would allow me to retrieve available lids for options without constructing the API, so I can do:

let available_locales = DateTimeFormat::get_available_lids_for_options(options);

let selected_locales = negotite_languages(requested_locales, available_locales);
let selected_locale = selected_locales[0];

let resolved_options = options.merge(selected_locale);
let dtf = DateTimeFormat::new(selected_locale, resoled_options);

and the latter would allow me to do:

let dtf_keys = DateTimeFormat::get_keys_for_options(dtf_options);
let pr_keys = PluralRules::get_keys_for_options(pr_options);
let nf_keys = NumberFormat::get_keys_for_options(nf_options);

let all_keys = union(&[dtf_keys, pr_keys, nf_keys]);

let icu4x_available_lids = DataProvider::get_lids_for_keys(&all_keys);
let msg_lids = L10nRegistry::get_lids_for_messages(&["menu.ftl", "errors.ftl"]);

let available_lids = intersection(&[icu4x_available_lids, msg_lids]);

let selected_locales = negotiate_languages(available_lids, requested_locales);
let selected_locale = selected_locales[0];

let dtf_resolved_options = dtf_options.merge(selected_locale);
let dtf = DateTImeFormat::new(selected_locale, dtf_selected_options);

let nf_resolved_options = nf_options.merge(selected_locale);
let nf = NumberFormat::new(selected_locale, nf_selected_options);

let messages = L10nRegistry::get_messages(selected_locale, &["menu.ftl", "errors.ftl"]);

// now Message Formatting is safe to also use NumberFormat and DateTimeFormat
// knowing that the all of them have data for that selected locale

// and if there's another formatter that was not involved in negotiation, or a
// DTF is to be created somewhere deep in message resolution, with options
// that require a key that is not present in selected_locale, then that DTF should
// gracefully fallback to the next best locale based on `selected_locale`.

sffc commented 10 months ago

A crucial question here is whether it is important to discriminante the list of supported locales based on data key. I proposed several solutions that relax this constraint but give us a solution which is smaller, faster, and easier to implement. If we need this constraint, we need to decide whether this use case is niche enough that we require people to build data in hybrid mode (resulting in larger data size) or if we want to change the data providers to support this type of query natively, such as by storing this information in a side table, and then plumbing it through a trait, which we could do in 2.0.

robertbastian commented 9 months ago

The proposals I had in mind do not work for @zbraniecki use cases. To implement something like negotiate_languages, one would need lists of locales. That said, a negotiate_languages approach seems super complicated. How are regional variants and fallback handled? If a single key doesn't support the language, should the whole page fall back to root (or select another language)?

sffc commented 9 months ago

From talking with @zbraniecki about this extensively today, here is my understanding:

In general, locales could be supported in three ways: (a) fully supported, (b) supported via an approximate fallback, and (c) not supported. An example of (b) is Basque that resolves to French or Catalan that resolves to Spanish. CLDR does not ever perform "approximate" fallback; its fallback chain is always for "fully supported." However, it provides tools and algorithms such as the LocaleMatcher that can give scores about how close two languages are to each other.

@zbraniecki agrees that there is no such thing in general as a list of supported locales and that it is valid for us to simply return whether a particular requested locale is supported or not.

It is important to reason about different components (which we can model as data keys) having different supported locales. For example, a client could add a data overlay adding support for more currencies in more locales, and we should reflect that the additional locales are supported in the currency formatter, even if they are not supported in the datetime formatter. Please note that these locales could be added at runtime, so we can't depend on a datagen-time list.

I believe we can do everything @zbraniecki needs to do when datagen is run with --fallback hybrid. However:

We still need to implement the pesky ResolveLocale AO in ECMA-402, which returns the locale to which the requested locale resolved. We agreed during the ECMA-402 discussion that it would probably not be spec-compliant to unconditionally echo back the input locale.
It would be nice to be able to answer the question "is this locale supported" without requiring --fallback hybrid.

Can you post your design within these constraints @robertbastian?

sffc commented 9 months ago

Here's a low-cost proposal that might be "close enough" to get the job done and be spec-compliant.

We add a new fallback mode called --fallback thin which is the same as --fallback hybrid except that it only retains implicit locales if they fall back to root. For example:

Locale	Requested?	In CLDR?	Include in Hybrid?	Include in Thin?
es	Y	Y	Y	Y
es-ES	N	Y	Y	N
it	N	Y	Y	Y
it-IT	N	Y	Y	N
it-RU	Y	N	Y	Y

Does that make sense? It basically means we end up with:

	Explicit Locales	CLDR Locales
Hybrid	Always	If ancestor or descendant
Hybrid-Thin	Always	Only if parent is root
Runtime-Thin	May be deduplicated	May be deduplicated, but include if parent is root
Runtime	May be deduplicated	May be deduplicated
Preresolved	Always	Ignore

(EDIT: I added Runtime-Thin as another mode.)

To answer the two questions:

To check if a requested locale is supported: run the fallback. If we find an entry before we reach und, return true. Else return false.
To load the resolved locale for the purposes of ECMA-402 ResolveLocale: run the fallback. Return the first locale in the fallback with data. Note: this returns a valid answer with either Hybrid-Thin and Runtime-Thin, but it might not return the most specific answer in Runtime-Thin.

robertbastian commented 9 months ago

I think a fallback mode is a great solution.

jedel1043 commented 9 months ago

Really liking the datagen flag solution! I'm wondering if it would still be worth to offer some trait or wrapper that prevents users from passing a data provider that doesn't support those operations thanks to its fallback type. This would entail storing a single byte for that on the provider itself.

zbraniecki commented 9 months ago

Here's an elaborate use case, let's run it through Shane's idea:

User Story: Amaya Web Browser

Amaya is a new web browser by a hip startup focused on AI generated LLM social crypto blockchain cold fusion DAO.

Amaya wants to use ICU4X for bulk of it's internationalization needs, but it also has a number of other localizable data sources, some of them data driven, others are algorithmic:

It brings CLDR driven ICU4X built for 20 locales with support for
- Plural Rules
- Number Format
It has ICU4X data for DateTimeFormat in 8 locales
It provides an override for Currencies adding 4 more locales, some of them are variants, some are completely new languages
It has MessageFormat 2 assets in 16 locales. Some of them match the ICU4X ones, some diverge
It has visual assets in 10 locales
It has ML driven TTS for 60 locales, not a perfect match
It has Timezone names in 18 locales

The browser would like to take customer requested list of locales, meaning ordered list of ICU4X Locale tags. It considers PluralRules, NumberFormat, TTS and MF2 resources to be "critical". In other words, it wants to take a union of locale availability of those three and only select the best locales from requested based on which ones those three resources are sufficiently available in. It considers DateTime, Timezone, and visual assets to be non critical, and are comfortable with those falling back to imperfect matches without regressing the main locale (Messages, Plurals, Numbers, TTS).

Variant: For Currencies, the authors may want to use better currencies even when the browser is not available in a given locale. Like, imagine customer requesting ["ar", "en"], and that Currencies are available for ar, while because PluralRules are missing, the browser is in en.

sffc commented 9 months ago

@zbraniecki It sounds like your use case boils down to the following question of ICU4X: "Does a given feature and locale have data in ICU4X?" Is that correct?

The approach I suggested above should be able to answer that question. Mechanically, attempt the data loading for your desired feature+locale; for efficiency, a DataRequestMetadata flag can be added so that this operation doesn't actually do work other than locale fallback. If you land on a locale other than the root locale, then your locale is supported. As noted previously, CLDR and ICU4X do not perform fuzzy fallback; if it finds data, then the feature is fully supported.

sffc commented 9 months ago

Really liking the datagen flag solution! I'm wondering if it would still be worth to offer some trait or wrapper that prevents users from passing a data provider that doesn't support those operations thanks to its fallback type. This would entail storing a single byte for that on the provider itself.

Yeah, it would be nice to avoid footguns in this area. The landscape is different for bake data vs buffer data. In bake data, we can generate impls based on whether or not the bundle was built with supported-locale prerequisites. In buffer data, we would either need to add a bit somewhere (perhaps in the metadata file on FsProvider or a new schema version on BlobProvider), or just tell people to be very careful when using RuntimeManual fallback mode which is the only one that I think has this problem.

Fallback Mode	Supports "Supported Locale" Queries, Bake?	Blob?
Runtime	No	N/A
RuntimeManual	No	No
Hybrid	Yes	Yes
Preresolved	Sort-of	Sort-of
Runtime-Thin	Yes	Yes
Hybrid-Thin	Yes	Yes

In Preresolved mode, locale fallback is not supposed to take place at all, but a "supported locale" query would return correct results for the locales for which the bundle was built, which is why I put "Sort-of" in those boxes.

Given that RuntimeManual is already a power-user feature, I'm kind-of okay with just documenting not to use that mode if you need supported-locale support.

sffc commented 9 months ago

In bake data, we can generate impls based on whether or not the bundle was built with supported-locale prerequisites

Wanted to dive in here a bit more. We could generate either "positive" impls (something that tells the data provider that supported-locale queries are supported) or "negative" impls (something that says we don't support such queries). I'm thinking that a negative impl might be easier, because I don't want to impose on all source providers that they set the flag. The runtime fallback baked data impl would set a field in DataResponseMetadata saying that this data does not support the supported-locale query. Whether or not to set the field could depend on the value of BuiltInFallbackMode; we would need to add another variant to that enum. Note that we would want to set the field in the baked data even for hybrid mode, which does not currently generate fallbacker code or populate DataResponseMetadata::locale.

Here is a potential field for DataResponseMetadata:

#[non_exhaustive]
pub enum FallbackerType {
    Basic,
    Deduplicated,
    NoFallback,
}

Baked data would set Basic for RuntimeThin and Deduplicated for Runtime. Supported locale queries would fail if they received a response with Deduplicated as its reported fallbacker type. It would however need to permit responses that didn't specify their fallbacker type.

jedel1043 commented 9 months ago

I think it should be enough to have a supports_querying_for_supported_locales_bikeshed: bool field inside the metadata, right? It should default to false and providers would be responsible for setting it to true if they support that functionality.

sffc commented 9 months ago

Hmm. Trying to think about the boolean condition that is easiest to explain and hardest to get wrong.

How about simply:

pub bool is_last_resort_fallback: Whether the data payload comes from a last-resort fallback. The data has well-defined behavior but it may or may not be correctly localized into the requested language.

The field defaults to false. It is set to true in the following situations:

If a LocaleFallbackProvider, as used for most buffer provider fallbacking, falls back to the und locale and und was not the requested locale
If a baked provider with runtime fallback falls back to the und locale and und was not the requested locale

With this logic, a source provider does not need any extra logic, which is great. If the source provider supports a locale, it just needs to make sure it returns non-und for that locale.

sffc commented 9 months ago

Observation: even if datagen is run with RuntimeManual, supported-locale queries work fine in positive cases; it's just that they might return false negatives. I'm okay with that caveat.

sffc commented 9 months ago

Okay let's bikeshed the datagen option. It could either be a new flag or a new fallback mode.

If a flag:

--always_include_base_language: Always includes an entry for the language even if it is identical to und. For example, en decimal symbols are equivalent to und decimal symbols; without this flag, en may be removed from the exported data if runtime fallback is enabled. This mode is required for supported-locale queries to work at runtime.

If a fallback mode:

--fallback runtime-augmented = same as --fallback runtime but always include an entry for the language even if it is identical to und.
--fallback runtime-manual-augmented = same as --fallback runtime-manual but always include an entry for the language even if it is identical to und.

Note: I don't think there needs to be any difference in locales between these two modes (which I had previously called RuntimeThin and HybridThin); it is only whether runtime fallback is enabled in baked data mode.

jedel1043 commented 9 months ago

A separate flag would avoid the potential confusion of having more types of fallback that are just used in special cases, so I would advocate for that.

sffc commented 9 months ago

If we go with a flag, maybe we can invert the polarity:

--strip_base_languages: Remove data in language or language-script if it is equal to the data in und and the language was not explicitly requested. Slightly reduces data size, but causes supported-locale queries to fail.

So then the nicer behavior is enabled by default, but we still give clients a way to achieve the minimal data size.

sffc commented 8 months ago

Current state of thinking, @sffc and @robertbastian:

A can_load or dry_run function may be wasteful given that it is likely to be followed by a plain load
We might prefer saving intermediate state or somehow having a handler that can restart the load. This could be done via an associated type on DataProvider
Not clear that this is entirely worth it. Performance depends on the ratio of failed to successful loads. If needed, use a caching provider to make the second call faster.

Some draft code:

New design: ```rust pub trait DataProvider { /// Query the provider for data, returning an error if the data is not loadable. /// /// Conceptually returns `self.load(req).map(|_| ())`, but implementations might be able to /// provide more performant implementations. fn dry_run(&self, req: DataRequest) -> Result; } pub enum LoadMeta { IntData(usize), StrData(TinyStr), ... } ``` Call site: ```rust for locale in locales { if let Ok(response_metadata) = provider.dry_run(locale.into()) { provider.load(DataRequest { locale: locale.into(), metadata: DataRequestMetadata { load_meta: response_metadata.load_meta, ...Default::default(), } }) } } ```

sffc commented 8 months ago

@zbraniecki Thoughts on the above?

It's hard to think of a case where you'd want to query the provider for "is this locale supported" without following quickly with a real load for the data, in which case we don't need to special-case a "dry run" code path.

jedel1043 commented 8 months ago

It's hard to think of a case where you'd want to query the provider for "is this locale supported" without following quickly with a real load for the data

ECMA-402 does precisely this, right? It repeatedly checks the parents of a single locale to see if it has support for the data (which could take several iterations if the locale is long enough), and moves the data fetching part of the pipeline after checking that all JS options are correct. Or are you talking about a separate thing entirely?

robertbastian commented 8 months ago

Right, that's a valid use case that's not check-then-load, so we won't be able to optimise the double lookup. I think a can_load is the best solution for this.

sffc commented 8 months ago

Another observation. DataProvider<M> calls are likely to be inlined since the function is highly generic. If the payload is not being used, the compiler has the opportunity to remove code generating it (so long as it can tell that the code has no side-effects).

I tend to think that we should just say that calling .load() to check if a locale is supported should be what we recommend, rather than putting a lot of time into what might be premature optimization.

ECMA-402 does precisely this, right? It repeatedly checks the parents of a single locale to see if it has support for the data (which could take several iterations if the locale is long enough), and moves the data fetching part of the pipeline after checking that all JS options are correct. Or are you talking about a separate thing entirely?

ECMA-402 requires taking a list of locales and checking which ones are supported. The use case of that function is for clients to pick a locale and then follow up with additional user code that might try to load data from that locale. So in a vacuum, ECMA-402 does not need the loaded values, but supportedLocalesOf is not typically used in a vacuum.

jedel1043 commented 8 months ago

ECMA-402 requires taking a list of locales and checking which ones are supported. The use case of that function is for clients to pick a locale and then follow up with additional user code that might try to load data from that locale. So in a vacuum, ECMA-402 does not need the loaded values, but supportedLocalesOf is not typically used in a vacuum.

I was more talking about things like the BestFitMatcher operation, which could make several data fetches just to resolve the preferred locale of a constructor call. However, I have to agree that this seems like a premature optimization, and users that want to optimise for data loads can just cache the previous results.

sffc commented 8 months ago

BestFitMatcher should be implemented by simply looping over the list and selecting the first one that doesn't return und.

jedel1043 commented 8 months ago

Shouldn't "BestFitMatcher" implement an "advanced" version of "LookupMatcher"? Or at least that's my expectation as an user, even if engines don't really implement that properly.

sffc commented 8 months ago

IIRC, no engine implements a more clever algorithm here.

jedel1043 commented 8 months ago

Yes, but that brings an interesting question to the table: is that because the engines don't want to differ too much from "LookupMatcher", or because there aren't "proper" APIs in ICU4C for exploratory queries such as can_load?

unicode-org / icu4x