unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.34k stars 174 forks source link

Datagen options around missing locales #3746

Open sffc opened 1 year ago

sffc commented 1 year ago

tl;dr, what should we do when a user tries to export a locale from datagen that isn't in CLDR?

At first thought, it seems that we should inform the client by failing datagen. However, this is more nuanced. Some caveats:

  1. Not all keys support the same set of locales.
  2. Not all keys have data at the root locale (example: collator/reord@1).
  3. Some keys require either an extension (example: datetime/skeletons@1) or soon an auxiliary key.

Because of caveats 2 and 3, we cannot always simply run fallback and fill in the data based on fallback.

@Manishearth suggested tagging data keys that aren't expected to fall back to root with some extra metadata. This fixes caveats 2 and 3 but not 1.

Related questions:

  1. Should the behavior depend on whether there was an explicit locale or whether the set of locales came from a CLDR set?
  2. Should the behavior depend on the fallback mode (i.e., should Precomputed be stricter than Hybrid)?

CC @robertbastian

sffc commented 1 year ago

Conclusion: retain 1.2 behavior for the time being; print a log statement for the error cases; revisit in 2.0

Manishearth commented 6 months ago

Proposal:

  1. Continue printing a warning in datagen when a request language falls back to und@ro in an unexpected way
  2. Do NOT retain the base language if the language is not in CLDR

LGTM: @robertbastian @sffc

Discussion:

Conclusion: @sffc/@robertbastian to design an API for this.

sffc commented 3 months ago

First thing we need is a clear definition of what it means when we say "failed to generate". The main thing is that all data can fall back to root, and this is the expected behavior in many cases.

I might propose the following definition: "the requested langid has no ancestors that are in the list in availableLocales.json". Unfortunately this definition only works with DatagenProvider as a data source.

A cleaner definition might be to just require that source providers in DatagenDriver return non-und data for all languages they support (i.e. RetainBaseLanguages-like behavior), and we can make sure DatagenProvider does this.

Once we decide on the definition, for the API, I think DatagenDriver::export should return a struct such as

#[non_exhaustive]
pub struct ExportResult {
    pub missing_locales: Vec<LanguageIdentifier>
}

On the CLI, we just send the list through log::info!.

Feedback? @robertbastian @zbraniecki

robertbastian commented 2 months ago

unrelated to missing locales, but I have another use case for ExportResult, returning the crates that a baked exporter needs. This is currently only logged.

sffc commented 2 months ago

Proposal:

LGTM: @sffc @robertbastian