unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

Implement Datagen API LocaleExpansionMode #4629

Closed sffc closed 1 month ago

sffc commented 8 months ago

As discussed in https://github.com/unicode-org/icu4x/pull/4606#issuecomment-1959923336.

If this is done before 1.5, we can remove pub fn with_base_language_handling added in #4606.

CC @robertbastian

sffc commented 8 months ago

@robertbastian Another API proposal.

pub enum LocaleExpansionMode {
    /// Locales always removed if identical to their parent
    Deduplicated,
    /// Locales always retained if explicitly requested or supported
    Exhaustive,
    /// Locales removed if identical to their parent, but not if the parent is `und`
    Hybrid,
    /// Only explicitly requested locales are retained
    Preresolved,
}

pub enum RuntimeFallbackLocation {
    Internal,
    External,
}

impl DatagenDriver {
    pub fn with_locale_expansion_mode(...)
    pub fn with_runtime_fallback_location(...)
}

Note: "hybrid" is repurposed.

These are two separate flags with an interdependent matrix of defaults.

Explicit LEM Exporter SBIF Resolved LEM Resolved RFL
None True Hybrid Internal
None False Exhaustive External
Deduplicated True Deduplicated Internal
Deduplicated False Deduplicated External
Exhaustive True Exhaustive Internal
Exhaustive False Exhaustive External
Hybrid True Hybrid Internal
Hybrid False Hybrid External
Preresolved True Preresolved External
Preresolved False Preresolved External

LEM = LocaleExpansionMode RFL = RuntimeFallbackLocation SBIF = Supports Built-In Fallback?

I highlighted cells that might be interesting.

Obviously if there is an explicit RFL, use it (or err if not allowed by SBIF).

A nice part about this design is that the flat enums lend themselves well to CLI.

robertbastian commented 8 months ago

I think this is also ok, however it looks harder to reach in an incremental change in 1.5. I also think the defaults table adds a lot of complexity that clients will have to understand.

sffc commented 8 months ago

it looks harder to reach in an incremental change in 1.5

We can add the two new functions to DatagenDriver and deprecate the old ones. If both the old functions and new functions are present, we return an error.

sffc commented 8 months ago

@jedel1043 Thoughts on https://github.com/unicode-org/icu4x/issues/4629#issuecomment-1962219412?

sffc commented 8 months ago

Also note @robertbastian that this design permits baked data to be built with internal fallback in Preresolved and Exhaustive (formerly Hybrid) modes, which is not currently possible.

robertbastian commented 8 months ago

I'd phrase this all more constructively:

pub enum LocaleInclusionMode {
    /// All requested locales, their ancestors, and their descendants are included, unless
    /// they fall back to a locale with identical data.
    ///
    /// Use this for minimal data size.
    Deduplicated,

    /// All requested locales, their ancestors, and their descendants are included, unless
    /// they fall back to a *non-root* locale with identical data.
    ///
    /// Use this if you need to determine at runtime if a base language has been
    /// included.
    DeduplicatedVariants,

    /// All requested locales, their ancestors, and their descendants are included.
    /// Unsupported locales will use datagen-time fallback to supported locales (or `und`).
    ///
    /// Use this mode if you want your data to work both without and with runtime fallback.
    Preresolved,

    /// Only explicitly requested locales, as well as `und`, are included. 
    /// Unsupported locales will use datagen-time fallback to supported locales (or `und`).
    ///
    /// Use this mode if you are not doing runtime fallback and only need these exact locales
    /// to work.
    ExplicitPreresolved,
}
jedel1043 commented 8 months ago

@jedel1043 Thoughts on #4629 (comment)?

That API is also fine. Documentation wise, we could just tell users to not use LocaleExpansionMode::Deduplicated if they need to make requested locale queries, which should be easy to document.

sffc commented 8 months ago

Naming bikeshed, with some help from Gemini:

How about:

  1. Compact
  2. Baseline
  3. Adaptive
  4. Selective
sffc commented 8 months ago

As far as functionality goes, I can still see the desire to add modes that include ancestors but not descendants, although @robertbastian pointed out that they can be modeled by locale variants instead. This is not relevant to the mode currently called Preresolved but it applies to the other three.

The situation: currently en-001 implies all of the following: en, en-001, en-GB, and all other variants that inherit directly or indirectly from en-001. It may be desirable to trade coverage for data size and only include the ancestors, like en, but not the descendants, like en-GB.

sffc commented 8 months ago

How about

  1. CompactSelective
  2. CompactExhaustive
  3. BaselineSelective
  4. BaselineExhaustive
  5. RedundantSelective
  6. RedundantExhaustive
  7. Explicit

Or maybe we just go back to toggle switches

  1. Deduplicate: always, except-base, or never
  2. Relatives: all, only ancestors, or none

In tabular form,

DeduplicationMode RelativesMode Name
Always All The thing currently named Runtime
Always Only Ancestors Not currently available; might be useful
Always None Not currently available; probably not useful
Except Base All The thing proposed in #4606
Except Base Only Ancestors Not currently available; might be useful
Except Base None Not currently available; probably not useful
Never All The thing currently named Hybrid
Never Only Ancestors Not currently available; might be useful
Never None The thing currently named Preresolved
jedel1043 commented 8 months ago

Or maybe we just go back to toggle switches

  1. Deduplicate: always, except-base, or never
  2. Relatives: all, only ancestors, or none

This sounds nice, since the first alternative would make it increasingly harder to add more fallback options thanks to the exponential growth of fallback types. Though, it could also be a bit unnecessary if we don't expect to have any more fallback options.

sffc commented 7 months ago

New draft:


#[non_exhaustive]
pub enum RuntimeFallbackLocation {
    Internal,
    External,
}

#[non_exhaustive]
pub enum DeduplicationStrategy {
    Maximal,
    #[default]
    RetainBaseLanguages,
    NoDeduplication,
}

#[non_exhaustive]
enum Locales {
    WithFallback(LocalesWithFallback),
    WithoutFallback(LocalesWithoutFallback),
}

#[non_exhaustive]
pub struct LocalesWithFallback {
    pub runtime_fallback_location: Option<RuntimeFallbackLocation>,
    pub deduplication_strategy: DeduplicationStrategy,
    /// private: set via constructor
    locales: Vec<LanguageWithExpansion>,
}

impl LocalesWithFallback {
    pub fn for_locales(impl IntoIterator<Item = LanguageWithExpansion>) -> Self;
}

#[non_exhaustive]
pub struct LocalesWithoutFallback  {
    pub locales: Vec<LanguageIdentifier>,
}

impl FromIterator<Item = LanguageIdentifier> for LocalesWithoutFallback {}
/// defaults to LocaleWithExpansion::with_variants
impl FromIterator<Item = LanguageIdentifier> for LocalesWithFallback {}
impl FromIterator<Item = LanguageWithExpansion> for LocalesWithFallback {}

#[derive(Debug, PartialEq, Clone, Hash)]
struct LanguageWithExpansion {
    langid: LanguageIdentifier,
    include_ancestors: bool,
    include_descendants: bool,
}

/// agrees with strings
impl From<LanguageIdentifier> for LanguageWithExpansion {}

impl FromStr for LanguageExpansion {}
impl Display for LanguageExpansion

impl LanguageWithExpansion {
    // en-US
    pub fn with_variants(L) -> Self {
        Self { langid, true, true } 
    }

    // ^en-US
    pub fn without_variants(L) -> Self {
        Self { langid, true, false } 
    }

    // @en-US
    pub fn preresolved(L) -> Self {
        Self {
            langid, false, false
        }
    }
}

// resolve this against the exporter like
foo.runtime_fallback_location.unwrap_or_else(|| 
    if exporter.supports_runtime_fallback() {
        RuntimeFallbackLocation::Internal
    } else { 
        RuntimeFallbackLocation::External 
    }
)

Example call site:

driver2.set_locales(Locales::WithFallback({
    let mut x = LocalesWithFallback::for_locales([
        LanguageWithExpansion::without_variants(langid!("de")),
        langid!("en").into(),
    ]);
    x.deduplication_strategy = DeduplicationStrategy::Maximal;
    x
})).export(...);

CLI:

icu4x-datagen --keys all --locales en-US,de-AT --without-fallback
icu4x-datagen --keys all --locales ^en,de-AT --fallback-runtime-location --fallback-deduplication-strategy

# Language expansion (@, ^) requires fallback

LGTM: @sffc @robertbastian

sffc commented 7 months ago

I'd still like LanguageWithExpansion to be an enum if possible. Brainstorm:

pub enum LocaleFamily {
    Maximal(LanguageIdentifier),
    Minimal(LanguageIdentifier),
    Preresolved(LanguageIdentifier),
}
sffc commented 7 months ago

Suggestion from @robertbastian: in the output, print clearly that how the DeduplicationStrategy and RuntimeFallbackLocation are being determined. Make it a log::warn to encourage users to be explicit. Note that the CLI can always set explicit options on the API, so the messages can be different.

API:

log::warn!("Set runtime_fallback_location = {:?} due to the supplied exporter", ...runtime_fallback_location);

log::warn!("Set deduplication_strategy = {:?} due to runtime_fallback_location", ...deduplication_strategy);

CLI:

log::warn!("Set --runtime-fallback-location={} due to --format=mod", args.runtime_fallback_location);

log::warn!("Set --deduplication-strategy={} due to --runtime-fallback-location", args.deduplication_strategy);

Example CLI invocation:

icu4x-datagen --keys ... --locales ... --deduplication-strategy maximal --runtime-fallback-location internal

Error CLI invocations:

icu4x-datagen --keys ... --locales ... --without-fallback --deduplication-strategy none

LGTM: @robertbastian @sffc

sffc commented 7 months ago

Current:

pub struct LocaleWithExpansion {
    langid: LanguageIdentifier,
    include_ancestors: bool,
    include_descendants: bool,
}

impl LocaleWithExpansion {
    /// en-US
    pub fn with_variants(langid: LanguageIdentifier) -> Self
    /// ^en-US
    pub fn without_variants(langid: LanguageIdentifier) -> Self
    /// @en-US
    pub fn preresolved(langid: LanguageIdentifier) -> Self
}

impl LocalesWithFallback {
    pub fn for_all_locales() -> Self
}

Proposed:

#[derive(Default, Copy, Clone)]
#[non_exhaustive]
pub struct FallbackOptions {
    /// If not set, determined by `exporter`
    pub runtime_fallback_location: Option<...>,
    /// If not set, determined by `runtime_fallback_location`
    pub deduplication_strategy: Option<...>,
}

#[derive(Default, Copy, Clone)]
#[non_exhaustive]
pub struct NoFallbackOptions {}

pub struct LocaleFamily {
    langid: LanguageIdentifier,
    include_ancestors: bool,
    include_descendants: bool,
}

impl LocaleFamily {
    /// en-US
    ///
    /// For example, the family ::with_descendants("en-001") contains "und", "en" (ancestors), "en-001", "en-GB", "en-ZA" (descendants)
    pub fn with_descendants(langid: LanguageIdentifier) -> Self
    /// ^en-US
    ///
    /// For example, the family ::without_descendants("en-001") contains "und", "en" (ancestors), "en-001"
    pub fn without_descendants(langid: LanguageIdentifier) -> Self
    /// @en-US
    pub fn single(langid: LanguageIdentifier) -> Self

    /// The locale family containing all locales, equivalent to `::with_descendants("und")`.
    pub fn all() -> Self
}

impl DatagenDriver {
    pub fn with_locales[_and_fallback](locales: impl FromIterator<Item = LocaleFamily>, options: FallbackOptions) -> Self;
    pub fn with_locales_no_fallback(locales: impl FromIterator<Item = LanguageIdentifier>, options: NoFallbackOptions) -> Self;
}

LGTM: @sffc @robertbastian

sffc commented 7 months ago

@robertbastian Currently, passing --locales und results in und being included in the output, but it does not imply descendants. This is incompatible with the proposal for --locales und to start meaning "all locales that are descended from und" (i.e. all locales).

A few options:

  1. Adopt the new behavior and special-case the old behavior internally somehow
  2. Make a special LocaleWithExpansion constructor for the wildcard, on the CLI represented as '*' (users would need to quote the asterisk on the CLI), and retain the current behavior for und

I lean toward 2 because I think --locales und to mean "all locales" is surprising more than it is clever.

robertbastian commented 7 months ago

Option 2 sounds good. We already have full on the CLI, so I think we should continue using that instead of *

sffc commented 7 months ago

The new API is more verbose. Old:

        DatagenDriver::new()
            .with_keys([HelloWorldV1Marker::KEY])
            .with_locales(SELECTED_LOCALES)
            .with_fallback_mode(FallbackMode::Hybrid)

New:

        DatagenDriver::new()
            .with_keys([HelloWorldV1Marker::KEY])
            .with_locales_and_fallback(
                SELECTED_LOCALES
                    .into_iter()
                    .map(LocaleFamily::with_descendants),
                {
                    let mut options = FallbackOptions::default();
                    options.deduplication_strategy = Some(DeduplicationStrategy::NoDeduplication);
                    options
                },
            )

Should we instead flatten this to

        DatagenDriver::new()
            .with_keys([HelloWorldV1Marker::KEY])
            .with_locales(SELECTED_LOCALES) // implies with_descendants
            .with_fallback()
            .with_deduplication(DeduplicationStrategy::NoDeduplication)

Full API

impl DatagenDriver {
    /// Appends these locales to the selected locale list. If fallback is enabled, implies with_descendants
    pub fn with_locales(self, impl IntoIterator<LangaugeIdentifier>) { ... }

    /// Appends these locales to the selected locale list.
    pub fn with_locale_families(self, impl IntoIterator<LocaleFamily>) { ... }

    /// Either without_fallback or with_fallback must be called.
    /// If without_fallback is used, an error will occur if any of these functions were also used:
    /// - with_locale_families
    /// - with_runtime_fallback_location
    /// - with_deduplication
    pub fn without_fallback(self) { ... }

    /// Either without_fallback or with_fallback must be called.
    pub fn with_fallback(self) { ... }

    /// Only valid in combination with with_fallback
    /// Default dependent on the exporter
    pub fn with_runtime_fallback_location(self, RuntimeFallbackLocation) { ... }

    /// Only valid in combination with with_fallback
    /// Default dependent on the runtime fallback location
    pub fn with_deduplication(self, DeduplicationStrategy) { ... }
}
robertbastian commented 7 months ago

It depends what you're doing. If we have the langid -> locale family conversion implementations, and you use default options, then the invocation will be

        DatagenDriver::new()
            .with_keys([HelloWorldV1Marker::KEY])
            .with_locales_and_fallback(SELECTED_LOCALES, Default::default())

which is actually more concise than currently. The whole idea of this design was to model the relationship of all the different flags better, which flattening does not.

sffc commented 7 months ago

How about

impl DatagenDriver {
    pub fn with_langids_and_fallback(self,
        impl IntoIterator<LangaugeIdentifier>, options: FallbackOptions) { ... }
    pub fn with_locale_families_and_fallback(self,
        impl IntoIterator<LocaleFamily>, options: FallbackOptions) { ... }
    pub fn with_langids_no_fallback(self,
        impl IntoIterator<LangaugeIdentifier>, options: NoFallbackOptions) { ... }
}
sffc commented 7 months ago

If we have the langid -> locale family conversion implementations

Can you clarify how you envision accomplishing this?

Without overloading, I can't define a function that takes "either a LanguageIdentifier iterator or a LocaleFamily iterator".

sffc commented 7 months ago

On LocaleFamily: currently we have

I suggested include_und be its own option, but @robertbastian prefers modeling it with LocaleFamily, so now we essentially need to add

I might be okay modeling this as

and just saying that you should always use base languages with this function.

robertbastian commented 7 months ago

Can you clarify how you envision accomplishing this?

I thought we had this before, but maybe it's not possible with the current design.

I don't think conciseness is the ultimate goal. Having many simple methods makes the API way more complicated, because it cannot be followed in a decision-tree fashion. With the two methods with_locales and with_locales_no_fallback, it's clear that that is the first decision, and then the types inside let you take the next decisions. If you have with_locales, with_locale_families, without_fallback, with_fallback, with_deduplication, with_runtime_fallback_location (plus all the other methods on the driver), you have no idea where to start.

Regarding LocaleFamily, I think with_descendants_without_und and without_descendants_or_und are confusing. und is not that special, and we cannot assume that there is always und, base language, regional language; some chains will be longer, and then without_ancestors is the right tool.

sffc commented 7 months ago

@robertbastian I want to propose the following narrow changes to the API in order to improve brevity... these would be impacting 2.0, not 1.5:

  1. Change with_locales_and_fallback(...) to with_locale_families(...) instead of with_locales(...)
  2. Add with_language_identifiers(...) that takes impl IntoIterator<Item = LanguageIdentifier>, FallbackOptions
  3. Un-deprecate with_all_locales(...) and make it take FallbackOptions
robertbastian commented 7 months ago

Not opposed in general, I'd like to make this decision once we have a final API. I don't want too many methods on the driver, and these don't add any functionality.

sffc commented 7 months ago

This issue is mostly done in #4710. Known follow-ups in the 1.5 timeframe:

All of these PRs can be standalone and should await @robertbastian's approval.