Closed sffc closed 1 month ago
@robertbastian Another API proposal.
pub enum LocaleExpansionMode {
/// Locales always removed if identical to their parent
Deduplicated,
/// Locales always retained if explicitly requested or supported
Exhaustive,
/// Locales removed if identical to their parent, but not if the parent is `und`
Hybrid,
/// Only explicitly requested locales are retained
Preresolved,
}
pub enum RuntimeFallbackLocation {
Internal,
External,
}
impl DatagenDriver {
pub fn with_locale_expansion_mode(...)
pub fn with_runtime_fallback_location(...)
}
Note: "hybrid" is repurposed.
These are two separate flags with an interdependent matrix of defaults.
Explicit LEM | Exporter SBIF | Resolved LEM | Resolved RFL |
---|---|---|---|
None | True | Hybrid | Internal |
None | False | Exhaustive | External |
Deduplicated | True | Deduplicated | Internal |
Deduplicated | False | Deduplicated | External |
Exhaustive | True | Exhaustive | Internal |
Exhaustive | False | Exhaustive | External |
Hybrid | True | Hybrid | Internal |
Hybrid | False | Hybrid | External |
Preresolved | True | Preresolved | External |
Preresolved | False | Preresolved | External |
LEM = LocaleExpansionMode RFL = RuntimeFallbackLocation SBIF = Supports Built-In Fallback?
I highlighted cells that might be interesting.
Obviously if there is an explicit RFL, use it (or err if not allowed by SBIF).
A nice part about this design is that the flat enums lend themselves well to CLI.
I think this is also ok, however it looks harder to reach in an incremental change in 1.5. I also think the defaults table adds a lot of complexity that clients will have to understand.
it looks harder to reach in an incremental change in 1.5
We can add the two new functions to DatagenDriver and deprecate the old ones. If both the old functions and new functions are present, we return an error.
@jedel1043 Thoughts on https://github.com/unicode-org/icu4x/issues/4629#issuecomment-1962219412?
Also note @robertbastian that this design permits baked data to be built with internal fallback in Preresolved and Exhaustive (formerly Hybrid) modes, which is not currently possible.
I'd phrase this all more constructively:
pub enum LocaleInclusionMode {
/// All requested locales, their ancestors, and their descendants are included, unless
/// they fall back to a locale with identical data.
///
/// Use this for minimal data size.
Deduplicated,
/// All requested locales, their ancestors, and their descendants are included, unless
/// they fall back to a *non-root* locale with identical data.
///
/// Use this if you need to determine at runtime if a base language has been
/// included.
DeduplicatedVariants,
/// All requested locales, their ancestors, and their descendants are included.
/// Unsupported locales will use datagen-time fallback to supported locales (or `und`).
///
/// Use this mode if you want your data to work both without and with runtime fallback.
Preresolved,
/// Only explicitly requested locales, as well as `und`, are included.
/// Unsupported locales will use datagen-time fallback to supported locales (or `und`).
///
/// Use this mode if you are not doing runtime fallback and only need these exact locales
/// to work.
ExplicitPreresolved,
}
@jedel1043 Thoughts on #4629 (comment)?
That API is also fine. Documentation wise, we could just tell users to not use LocaleExpansionMode::Deduplicated
if they need to make requested locale queries, which should be easy to document.
Naming bikeshed, with some help from Gemini:
Deduplicated
Compact
CompactInheritance
(Emphasizes the space-saving aspect and the hierarchical nature of locale inclusion)CompactForFallback
Compressed
(Leans into the data size optimization aspect. Implies intelligent handling of locales for efficiency.)CompressedForFallback
AdaptiveMinimization
(Highlights both the space-saving aspect and the ability to intelligently adapt to supported locales.)Cascading
(Metaphor: Flow of data from broader locales to specific ones, much like water in a river system.)MinimalRedundancy
(Emphasizes both the inheritance structure and the focus on avoiding duplicate data.)SmartInheritance
CompactRedundancy
DeduplicatedWithBaseLanguages
RootAwareInheritance
CompactRootAware
BaselineInheritance
(Emphasizes the root locale as a guaranteed foundation. "Baseline" hints at a fundamental starting point.)RootSensitiveInheritance
(Very similar to "Root-Aware Inheritance", but emphasizes sensitivity to the root relationship for accuracy.)Ancestral
(Metaphor: Family tree representing locale relationships, with the root as the common origin.)RootAware
(Prioritizes a non-root base locale while emphasizing the ability to identify it.)BaseAwareInheritance
Comprehensive
(Communicates the ability to seamlessly adjust to different locale support scenarios.)ComprehensiveFallback
(Focuses on the robust fallback mechanism to ensure data is always available, even with runtime decisions)ComprehensiveInheritance
ComprehensiveWithFallback
PreresolvedWithFallback
Universal
(Puts the focus on the fail-safe mechanism. Suggests broad compatibility)ResilientFallback
(Puts the emphasis on the fallback mechanism's dependability for consistent data availability.)Adaptive
(Metaphor: Network where data is accessible through various connected nodes (locales).)Robust
(Plainly states a focus on reliable access to a wide range of locales.)RobustRedundancy
Preresolved
Explicit
ExplicitSelection
(Clearly communicates that only specifically requested locales are included, ideal for scenarios with fixed requirements)TargetedSelection
(Similar to "Explicit Selection" but slightly more active in its language.)PreciseSelection
(Similar to previous options but emphasizes the exactness of the included locales for controlled scenarios.)Focused
(Metaphor: Spotlighting only the necessary data points (locales).)Selective
How about:
As far as functionality goes, I can still see the desire to add modes that include ancestors but not descendants, although @robertbastian pointed out that they can be modeled by locale variants instead. This is not relevant to the mode currently called Preresolved but it applies to the other three.
The situation: currently en-001
implies all of the following: en
, en-001
, en-GB
, and all other variants that inherit directly or indirectly from en-001
. It may be desirable to trade coverage for data size and only include the ancestors, like en
, but not the descendants, like en-GB
.
How about
Or maybe we just go back to toggle switches
In tabular form,
DeduplicationMode | RelativesMode | Name |
---|---|---|
Always | All | The thing currently named Runtime |
Always | Only Ancestors | Not currently available; might be useful |
Always | None | Not currently available; probably not useful |
Except Base | All | The thing proposed in #4606 |
Except Base | Only Ancestors | Not currently available; might be useful |
Except Base | None | Not currently available; probably not useful |
Never | All | The thing currently named Hybrid |
Never | Only Ancestors | Not currently available; might be useful |
Never | None | The thing currently named Preresolved |
Or maybe we just go back to toggle switches
- Deduplicate: always, except-base, or never
- Relatives: all, only ancestors, or none
This sounds nice, since the first alternative would make it increasingly harder to add more fallback options thanks to the exponential growth of fallback types. Though, it could also be a bit unnecessary if we don't expect to have any more fallback options.
New draft:
#[non_exhaustive]
pub enum RuntimeFallbackLocation {
Internal,
External,
}
#[non_exhaustive]
pub enum DeduplicationStrategy {
Maximal,
#[default]
RetainBaseLanguages,
NoDeduplication,
}
#[non_exhaustive]
enum Locales {
WithFallback(LocalesWithFallback),
WithoutFallback(LocalesWithoutFallback),
}
#[non_exhaustive]
pub struct LocalesWithFallback {
pub runtime_fallback_location: Option<RuntimeFallbackLocation>,
pub deduplication_strategy: DeduplicationStrategy,
/// private: set via constructor
locales: Vec<LanguageWithExpansion>,
}
impl LocalesWithFallback {
pub fn for_locales(impl IntoIterator<Item = LanguageWithExpansion>) -> Self;
}
#[non_exhaustive]
pub struct LocalesWithoutFallback {
pub locales: Vec<LanguageIdentifier>,
}
impl FromIterator<Item = LanguageIdentifier> for LocalesWithoutFallback {}
/// defaults to LocaleWithExpansion::with_variants
impl FromIterator<Item = LanguageIdentifier> for LocalesWithFallback {}
impl FromIterator<Item = LanguageWithExpansion> for LocalesWithFallback {}
#[derive(Debug, PartialEq, Clone, Hash)]
struct LanguageWithExpansion {
langid: LanguageIdentifier,
include_ancestors: bool,
include_descendants: bool,
}
/// agrees with strings
impl From<LanguageIdentifier> for LanguageWithExpansion {}
impl FromStr for LanguageExpansion {}
impl Display for LanguageExpansion
impl LanguageWithExpansion {
// en-US
pub fn with_variants(L) -> Self {
Self { langid, true, true }
}
// ^en-US
pub fn without_variants(L) -> Self {
Self { langid, true, false }
}
// @en-US
pub fn preresolved(L) -> Self {
Self {
langid, false, false
}
}
}
// resolve this against the exporter like
foo.runtime_fallback_location.unwrap_or_else(||
if exporter.supports_runtime_fallback() {
RuntimeFallbackLocation::Internal
} else {
RuntimeFallbackLocation::External
}
)
Example call site:
driver2.set_locales(Locales::WithFallback({
let mut x = LocalesWithFallback::for_locales([
LanguageWithExpansion::without_variants(langid!("de")),
langid!("en").into(),
]);
x.deduplication_strategy = DeduplicationStrategy::Maximal;
x
})).export(...);
CLI:
icu4x-datagen --keys all --locales en-US,de-AT --without-fallback
icu4x-datagen --keys all --locales ^en,de-AT --fallback-runtime-location --fallback-deduplication-strategy
# Language expansion (@, ^) requires fallback
LGTM: @sffc @robertbastian
I'd still like LanguageWithExpansion
to be an enum if possible. Brainstorm:
pub enum LocaleFamily {
Maximal(LanguageIdentifier),
Minimal(LanguageIdentifier),
Preresolved(LanguageIdentifier),
}
Suggestion from @robertbastian: in the output, print clearly that how the DeduplicationStrategy and RuntimeFallbackLocation are being determined. Make it a log::warn
to encourage users to be explicit. Note that the CLI can always set explicit options on the API, so the messages can be different.
API:
log::warn!("Set runtime_fallback_location = {:?} due to the supplied exporter", ...runtime_fallback_location);
log::warn!("Set deduplication_strategy = {:?} due to runtime_fallback_location", ...deduplication_strategy);
CLI:
log::warn!("Set --runtime-fallback-location={} due to --format=mod", args.runtime_fallback_location);
log::warn!("Set --deduplication-strategy={} due to --runtime-fallback-location", args.deduplication_strategy);
Example CLI invocation:
icu4x-datagen --keys ... --locales ... --deduplication-strategy maximal --runtime-fallback-location internal
Error CLI invocations:
icu4x-datagen --keys ... --locales ... --without-fallback --deduplication-strategy none
LGTM: @robertbastian @sffc
Current:
pub struct LocaleWithExpansion {
langid: LanguageIdentifier,
include_ancestors: bool,
include_descendants: bool,
}
impl LocaleWithExpansion {
/// en-US
pub fn with_variants(langid: LanguageIdentifier) -> Self
/// ^en-US
pub fn without_variants(langid: LanguageIdentifier) -> Self
/// @en-US
pub fn preresolved(langid: LanguageIdentifier) -> Self
}
impl LocalesWithFallback {
pub fn for_all_locales() -> Self
}
Proposed:
#[derive(Default, Copy, Clone)]
#[non_exhaustive]
pub struct FallbackOptions {
/// If not set, determined by `exporter`
pub runtime_fallback_location: Option<...>,
/// If not set, determined by `runtime_fallback_location`
pub deduplication_strategy: Option<...>,
}
#[derive(Default, Copy, Clone)]
#[non_exhaustive]
pub struct NoFallbackOptions {}
pub struct LocaleFamily {
langid: LanguageIdentifier,
include_ancestors: bool,
include_descendants: bool,
}
impl LocaleFamily {
/// en-US
///
/// For example, the family ::with_descendants("en-001") contains "und", "en" (ancestors), "en-001", "en-GB", "en-ZA" (descendants)
pub fn with_descendants(langid: LanguageIdentifier) -> Self
/// ^en-US
///
/// For example, the family ::without_descendants("en-001") contains "und", "en" (ancestors), "en-001"
pub fn without_descendants(langid: LanguageIdentifier) -> Self
/// @en-US
pub fn single(langid: LanguageIdentifier) -> Self
/// The locale family containing all locales, equivalent to `::with_descendants("und")`.
pub fn all() -> Self
}
impl DatagenDriver {
pub fn with_locales[_and_fallback](locales: impl FromIterator<Item = LocaleFamily>, options: FallbackOptions) -> Self;
pub fn with_locales_no_fallback(locales: impl FromIterator<Item = LanguageIdentifier>, options: NoFallbackOptions) -> Self;
}
LGTM: @sffc @robertbastian
@robertbastian Currently, passing --locales und
results in und
being included in the output, but it does not imply descendants. This is incompatible with the proposal for --locales und
to start meaning "all locales that are descended from und" (i.e. all locales).
A few options:
LocaleWithExpansion
constructor for the wildcard, on the CLI represented as '*'
(users would need to quote the asterisk on the CLI), and retain the current behavior for und
I lean toward 2 because I think --locales und
to mean "all locales" is surprising more than it is clever.
Option 2 sounds good. We already have full
on the CLI, so I think we should continue using that instead of *
The new API is more verbose. Old:
DatagenDriver::new()
.with_keys([HelloWorldV1Marker::KEY])
.with_locales(SELECTED_LOCALES)
.with_fallback_mode(FallbackMode::Hybrid)
New:
DatagenDriver::new()
.with_keys([HelloWorldV1Marker::KEY])
.with_locales_and_fallback(
SELECTED_LOCALES
.into_iter()
.map(LocaleFamily::with_descendants),
{
let mut options = FallbackOptions::default();
options.deduplication_strategy = Some(DeduplicationStrategy::NoDeduplication);
options
},
)
Should we instead flatten this to
DatagenDriver::new()
.with_keys([HelloWorldV1Marker::KEY])
.with_locales(SELECTED_LOCALES) // implies with_descendants
.with_fallback()
.with_deduplication(DeduplicationStrategy::NoDeduplication)
Full API
impl DatagenDriver {
/// Appends these locales to the selected locale list. If fallback is enabled, implies with_descendants
pub fn with_locales(self, impl IntoIterator<LangaugeIdentifier>) { ... }
/// Appends these locales to the selected locale list.
pub fn with_locale_families(self, impl IntoIterator<LocaleFamily>) { ... }
/// Either without_fallback or with_fallback must be called.
/// If without_fallback is used, an error will occur if any of these functions were also used:
/// - with_locale_families
/// - with_runtime_fallback_location
/// - with_deduplication
pub fn without_fallback(self) { ... }
/// Either without_fallback or with_fallback must be called.
pub fn with_fallback(self) { ... }
/// Only valid in combination with with_fallback
/// Default dependent on the exporter
pub fn with_runtime_fallback_location(self, RuntimeFallbackLocation) { ... }
/// Only valid in combination with with_fallback
/// Default dependent on the runtime fallback location
pub fn with_deduplication(self, DeduplicationStrategy) { ... }
}
It depends what you're doing. If we have the langid -> locale family conversion implementations, and you use default options, then the invocation will be
DatagenDriver::new()
.with_keys([HelloWorldV1Marker::KEY])
.with_locales_and_fallback(SELECTED_LOCALES, Default::default())
which is actually more concise than currently. The whole idea of this design was to model the relationship of all the different flags better, which flattening does not.
How about
impl DatagenDriver {
pub fn with_langids_and_fallback(self,
impl IntoIterator<LangaugeIdentifier>, options: FallbackOptions) { ... }
pub fn with_locale_families_and_fallback(self,
impl IntoIterator<LocaleFamily>, options: FallbackOptions) { ... }
pub fn with_langids_no_fallback(self,
impl IntoIterator<LangaugeIdentifier>, options: NoFallbackOptions) { ... }
}
If we have the langid -> locale family conversion implementations
Can you clarify how you envision accomplishing this?
Without overloading, I can't define a function that takes "either a LanguageIdentifier iterator or a LocaleFamily iterator".
On LocaleFamily: currently we have
with_descendants(en-001)
= und, en, en-001, en-GBwithout_descendants(en-001)
= und, en, en-001single(en-001)
= en-001I suggested include_und
be its own option, but @robertbastian prefers modeling it with LocaleFamily, so now we essentially need to add
with_descendants_without_und(en-001)
= en, en-001, en-GBwithout_descendants_or_und(en-001)
= en, en-001I might be okay modeling this as
without_ancestors(en-001)
= en-001, en-GBand just saying that you should always use base languages with this function.
Can you clarify how you envision accomplishing this?
I thought we had this before, but maybe it's not possible with the current design.
I don't think conciseness is the ultimate goal. Having many simple methods makes the API way more complicated, because it cannot be followed in a decision-tree fashion. With the two methods with_locales
and with_locales_no_fallback
, it's clear that that is the first decision, and then the types inside let you take the next decisions. If you have with_locales
, with_locale_families
, without_fallback
, with_fallback
, with_deduplication
, with_runtime_fallback_location
(plus all the other methods on the driver), you have no idea where to start.
Regarding LocaleFamily
, I think with_descendants_without_und
and without_descendants_or_und
are confusing. und
is not that special, and we cannot assume that there is always und
, base language, regional language; some chains will be longer, and then without_ancestors
is the right tool.
@robertbastian I want to propose the following narrow changes to the API in order to improve brevity... these would be impacting 2.0, not 1.5:
with_locales_and_fallback(...)
to with_locale_families(...)
instead of with_locales(...)
with_language_identifiers(...)
that takes impl IntoIterator<Item = LanguageIdentifier>, FallbackOptions
with_all_locales(...)
and make it take FallbackOptions
Not opposed in general, I'd like to make this decision once we have a final API. I don't want too many methods on the driver, and these don't add any functionality.
This issue is mostly done in #4710. Known follow-ups in the 1.5 timeframe:
und
via LocaleFamilywith_language_identifiers(impl IntoIterator, FallbackOptions)
LocaleFamily::full()
a const itemLanguageIdentifiersOrLocaleFamilies
or LocaleInclude
in the internal intermediate CLI config codeAll of these PRs can be standalone and should await @robertbastian's approval.
As discussed in https://github.com/unicode-org/icu4x/pull/4606#issuecomment-1959923336.
If this is done before 1.5, we can remove
pub fn with_base_language_handling
added in #4606.CC @robertbastian