Open sffc opened 4 years ago
Putting this on the Version 1 backlog. We still need help on this.
Here's my current thinking.
There are three levels of locale support in a data provider:
A particular provider must store metadata listing the resolved locales, which increases the complexity. I don't see a way around this. In the fs provider, the list is stored in the manifest.json file, and in the blob provider, it is stored alongside the big zeromap.
The list is could be exposed through the following trait:
pub trait SupportedLocales {
/// Returns resolved locales or MissingResourceKey if the key is not supported
fn supported_locales_for_key(&self, key: ResourceKey) -> Result<Vec<ResourceOptions>, DataError>;
}
In ForkByKeyProvider, the implementation works the same as load_payload
: we loop over the providers until finding one that doesn't return MissingResourceKey.
To answer the ECMA-402 question "what are the supported locales for NumberFormat", you use this API with the key decimal/symbols@1
. When we add more versions of these data keys, we should likely loop over all versions of the key that our code supports and take the union.
Another mode that aligns closer to ECMA-402 is
pub trait SupportedLocales {
/// Returns the subset of locales that are fallbackable locales.
fn supported_locales_of(&self, key: ResourceKey, locales: Vec<ResourceOptions>) -> Result<Vec<ResourceOptions>, DataError>;
}
ForkByKeyProvider would again look for the first provider that supports the key, and then return its result.
An issue here is that vertical fallback would need to be invoked. Therefore, rather than having this API, it may be better and cheaper to just run the full vertical fallback stack, but stop short of deserializing/downcasting.
I propose considering what I proposed above as the course of action, and removing this from the 1.0 critical path.
I've been implementing the Intl
spec for the past couple of months, and I can give a bit of my perspective on this issue.
From what I could see while implementing the locale resolution algorithms defined in the ECMA402 spec, what the API seems to try to accomplish is to determine if a specific locale will return "correct" results if used as the locale of a specific service. Everything else is just taking that and extending it to several different APIs that filter/choose/tune a list of user-provided locales to ensure that all locales passed to the services are always "valid" in a sense.
Maybe this means that the providers don't explicitly need a SupportedLocales
feature, but more like a way to pass them a locale and a key/service to know if that key/service using that locale will return "correct" results.
Collator is special in the sense that in the absence of natural-language output, und
is more applicable as fallback than for services that involve natural-language output. Currently, if you request e.g. en
, the collator in ICU4X falls back to und
, which is correct in terms of comparison behavior, but existing Web-exposed behavior of Intl.Collator
is that languages for which the root collation is known to be valid (without reordering), such as en
, fr
, etc., are supposed to behave in the outward API as if language-specific data existed for them. Furthermore, in Firefox and Chrome 1) und
is treated as unsupported by supportedLocalesOf
and 2) locales with actually-unsupported subtags are treated as supported by supportedLocalesOf
if the language counts as supported.
console.log(Intl.Collator.supportedLocalesOf(['und', 'und-u-co-emoji', 'ban', 'id-u-co-pinyin', 'de-ID', "en", "fr", "el"]));
logs the array [ "id-u-co-pinyin", "de-ID", "en", "fr", "el" ]
in both Firefox and Chrome.
(In Safari, the array is prepended with "en-US-u-va-posix"
, which is just weird. Safari turns "und" (but not "und-u-co-emoji") into that, even though in CLDR the POSIX variant of English is not the same as the root. Weird.)
Boa currently logs the whole input array, because of "// TODO: ugly hack to accept locales that fallback to "und" in the collator/segmenter services
".
It's not clear to me how "Resolved Locales" above should capture the "root is known to be valid" concept. As I understand it, we don't currently store this information. To the extent the provider infrastructure supports aliases, I guess one possibility would be list en
, fr
, etc. as aliases of und
.
Based on feedback from @anba on https://github.com/tc39/ecma402/issues/830, one approach which would solve this problem fairly cleanly would be for datagen to record which locales were used when generating data, and then use that set of locales across components as the availableLocales
in ECMA-402. With this model, it doesn't matter if individual data keys resolve to different locales. So long as the locale was included at datagen time, we know that it resolves to valid data.
Here are a few forms this solution could take:
- Add a new trait or data key that retrieves the list of locales. Pro: Easy to implement Con: Unclear behavior when multiple source data providers are present.
Combining multiple providers needs to be solved for all three cases, and this is probably the cleanest for both clients and us. The only trip-up would be ForkByKeyProvider
, which we can specialise to return empty and log a warning. The bigger issue here I think is where would this data struct be defined, and where would compiled data for it be included? It would probably have to be icu_provider
, which has been data-free so far.
I understand how recording the datagen-time locales would allow for a) implementing "lookup" and b) filtering a list of OS-provided preferred languages to pick the most-preferred system language that is also an ICU4X available locale (for the purpose of computing an ECMA-402-compatible notion of host locale).
It's not immediately clear to me how storing this list would allow the resolved locale concept of ECMA-402 to be implemented in the case where "best fit" means delegation to ICU4X's own matching of request to available data.
AFAICT, implementing "best fit" on top of ICU4X requires being able to figure out a) whether there was a fallback all the way to root such that root isn't known-valid for the requested locale (so that the ECMA-402 glue code can proceed to trying the next locale on the list of requested locales or the default locale if the last locale on the request list fell back all the way to root without the root being known to be valid for the request) and b) if ICU4X didn't fall back all the way to root or fell back to root such that the root is known to be valid for the request, which locale would be the ECMA-402 resolved locale.
How would these be implemented given this list?
(Examples worth considering: For the collator requesting fr
or de
is known to be equivalent to requesting und
. However, requesting fr-CA
or de-u-co-phonebk
are not equivalent to requesting und
. If the requested locale has fr-CA
plus some other subtags or de-u-co-phonebk
with some other subtags, a) is it possible for the ICU4X fallback to end up ignoring -CA
or -u-co-phonebk
ending up with und
-equivalent and, more generally, b) how should the ECMA-402 glue code figure out the resolved locale if the request had extra subtags and the datagen-time recording of processed locales contains fr-CA
and de-u-co-phonebk
?)
I think this would require the hybrid data mode that Shane mentioned in the ECMA issue. fr
and de
would have to be explicitly included as keys that point to und
data, whereas abc
would not be a key and would fall back to und
data, which is a difference we can detect.
Discuss with:
Another possible solution:
supported@1/en-GB
which contains a single string "en-GB" so it doesn't get deduplicated.
An advantage of (4) is that you can query the data provider and get back the resolved supported locale. For example, you can request "it-JP" and get back "it" (if there was no CLDR data for "it-JP"), or you can request a non-Basic locale such as "arc" and get back "und" which means that the locale is not "supported".
What crate should this go into? I lean toward putting it with the rest of the fallback-type code in icu_locid_transform
.
locid makes sense. We could also have a custom provider that stores all locales loaded that you can mutate at runtime and query stuff like this, to keep track of what you have and haven't loaded.
This key solution also makes it easier to reason about data generated in fallback mode.
The potential footgun is of course if you generate data for different sets of locales for different keys. Should be fine to just clearly document.
How does this look?
#[icu_provider::data_struct(SupportedLocaleV1Marker)]
pub struct SupportedLocaleV1<'data> {
pub locale: Cow<'data, [u8]>,
}
pub struct SupportedLocale {
data: DataPayload<SupportedLocaleV1Marker>,
}
impl SupportedLocale {
pub fn load_unstable<P: DataProvider<SupportedLocaleV1Marker>>(
provider: &P, locale: &DataLocale
) -> Result<Self> { ... }
pub fn is_und(&self) -> bool { ... }
pub fn to_locid(&self) -> LanguageIdentifier { ... }
pub fn to_locale(&self) -> Locale { ... }
}
sgtm
@robertbastian to make a counter-proposal.
Another option would be to directly store the locale, and just use the marker to load the supported locale:
#[icu_provider::data_struct(SupportedLocaleV1Marker)]
pub struct SupportedLocaleV1<'data> {
pub locale: Cow<'data, [u8]>,
}
pub struct SupportedLocale {
locale: DataLocale,
}
impl SupportedLocale {
pub fn load_unstable<P: DataProvider<SupportedLocaleV1Marker>>(
provider: &P, locale: DataLocale
) -> Result<Self> {
let response = provider.load(DataRequest { locale: &locale, ..Default::default() })?;
Ok(Self {
locale: response.metadata.locale.unwrap_or(locale),
})
}
pub fn is_und(&self) -> bool { ... }
pub fn to_locid(&self) -> LanguageIdentifier { ... }
pub fn to_locale(&self) -> Locale { ... }
}
I'm concerned about relying on response.metadata.locale
because
I acknowledge that my proposed solution deals with locales as strings when it feels like we should deal with them as upgraded objects. However:
cmp_bytes
In other words, we don't have a clear use case where we actually need the upgraded type.
That said, I am okay with caching the DataLocale from either the DataRequest or the DataResponseMetadata if it equals the locale in the datapayload as a performance optimization. But I see this as an internal change, not an architectural one.
In other words, we don't have a clear use case where we actually need the upgraded type.
Um, using the upgraded type would make it easier to implement the BestFitMatcher
operation from ECMA-402, since that needs to execute fallback to get the best supported locales from a list of requested locales (to_locale
technically also provides the same ease of use, but that would rely on always having a valid BCP-47 string, whereas doing the parsing on load_unstable
allows us to throw an error there).
Leaving that aside, if we don't want to rely on response.metadata.locale
, we can just modify load_unstable
:
pub fn load_unstable<P: DataProvider<SupportedLocaleV1Marker>>(
provider: &P,
locale: DataLocale,
) -> Result<Self, DataError> {
let (metadata, payload) = provider
.load(DataRequest {
locale: &locale,
..Default::default()
})?
.take_metadata_and_payload()?;
let supported_locale = &payload.get().locale;
match metadata.locale {
Some(loc) if loc.strict_cmp(supported_locale).is_eq() => {
Ok(SupportedLocale { locale: loc })
}
None if locale.strict_cmp(supported_locale).is_eq() => {
Ok(SupportedLocale { locale })
}
_ => {
let locale = Locale::try_from_bytes(supported_locale).map_err(|e| {
DataError::custom("supported locale was not a valid BCP-47 string")
.with_display_context(&e)
})?;
Ok(SupportedLocale { locale: locale.into() })
}
}
}
Sorry I haven't had time to write out a proper solution, and I won't today. Just one thing I noticed
The resolved locale might not always be set in the metadata, such as when reading from a blob in hybrid mode
The resolved locale is not set in the metadata iff it's the request locale (or that one tutorial that we'd have to update). So doing response.metadata.locale.unwrap_or_else(|| req.locale.clone())
will always work, including with blob providers in hybrid mode.
The resolved locale is not set in the metadata iff it's the request locale (or that one tutorial that we'd have to update). So doing response.metadata.locale.unwrap_or_else(|| req.locale.clone()) will always work, including with blob providers in hybrid mode.
Yeah, that was my understanding too, and I'm pretty sure we use that property on Boa to resolve locales. Maybe documenting this would be enough to ensure we can rely on the metadata.
I'm currently porting fluent-rs stack to icu4x and encountered this issue in a context different from traditional ICU LocaleMatcher
or ECMA-402 one.
In Fluent, we intentionally diverged from LocaleMatcher and implemented a different language negotiation heuristics as part of fluent-langneg
.
The basic of heuristic tho is similar and it also requires a list of locales that the data is available for. In Fluent world, this list is fed, together with requested locales list to produce the result list that is then used (potentially after nested negotiation for complex scenarios) as a base locale list for I18n. Mozilla documentation provides an outline of the model.
For my use case, I need to be able to collect/flatten a list of locales available for a considered list of components into a union.
For a simple example, I may have list of Fluent resources available in 20 locales. I have a list of them. Now I want to learn what PluralRules locales are available, take the intersection of those and negotiate it against user requested locales.
I do this because I want to make sure I do not use Fluent resources for locale that we have no PluralRules for. This example scales to DateTImeFormat, NumberFormat and others.
So in the ideal world I do something like this:
fn bootstrap_locales(): Vec<LanguageIdentifier> {
let fluent_locales = get_available_fluent_locales();
let plural_locales = PluralRules::get_available_locales();
let number_locales = NumberFormat::get_available_locales();
let available_locales = union(&[fluent_locales, plural_locales, number_locales]);
let negotiated = negotiate_languages(
get_requested_locales(),
available_locales,
last_fallback_locale,
NegotiationStrategy::Filtering,
);
return negotiated;
}
let bundle = L10nRegistry.get_bundles(negotiated_locales, fluent_resources);
The bootstrapping happens rarely, at startup and at requested/available locales change. get_bundles
may happen often and it is guaranteed to operate on locales that are available.
I understand that the complexity here is that there's no component locales
really - much like with l10n resources it may depends on what resources I want. menu.ftl
may be in 50 locales, but sidebar.ftl
may be in 15.
Cardinal keys may be in 50, Ordinal may be in 10.
So what we may want here is an ability to pass the same parameters as we would to a constructor, but instread of creating it, we'd just ask the constructor to ask DataProvider for the locales for the right set of keys and return them.
This would still require a bit of convoluted logic, like in my case I'd need to ask for locales for cardinal and ordinal and use the intersection of those. Not sure if it's worth separate API path to ask for wider selection of options that are exclusive in a constructor.
@zbraniecki The latest thinking (exact API shape yet to be decided) is that the supported locales list is based on the locales for which data was generated. So even if a locale doesn't have an ordinals key (perhaps because it inherits from root), it may be included in the supported locales list. Does that work for your use case?
Also, I don't know if returning a list is on the table. Lists are bad because they don't work well with regional variants. ECMA-402 supports returning a subset of the given input list instead.
I have three reasons for which I believe we should allow for locale resolution to happen explicitly and outside of the constructor.
Let me explain them.
Starting point is an ergonomic ECMA-402 API - a high level API that performs operations implicitly:
class DateTimeFormat {
constructor(locales, options) {
let resolvedOptions = resolveOptions(locales, options);
let resolvedLocale = selectLocale(locales, resolvedOptions);
this.data = getDataForOptions(resolvedLocale, resolvedOptions);
}
static supportedLocalesOf(requestedLocales, options) {
let availableLocales = getLocalesWeHaveDTFDataFor(options);
return negotiateLocales(availableLocales, requestedLocales);
}
}
In a lower level API like ICU4X, I believe we should externalize both operations from the constructor.
With icu_preferences
for ICU4X 2.0 I'm externalizing options resolution, but the input is a list of locales, so we need to select which locale we're working with to resolve the options.
Example:
let requestedLocales = &["en-US-u-hc-h12", "de-CA-u-hc-h23"];
let availableLocales = ???;
let lid = selectLID(requestedLocales, availableLocales);
let resolvedOptions = optionsBag.merge(locale);
let dtf = DateTimeFormat::new(lid, resolvedOptions);
I support MacOS model (Windows IIRC plasters the same Unicode Extensions on all requested locales, MacOS allows customer to specify UE per locale). That means that by the time we want to merge options bag with a locale, we need to know which locale we're working with.
One of the major architectural values of externalization (as a tradeoff for ergonomics) is that it allows customers to write their own options merging logic, or locale resolution.
In ECMA-402 this is replaced with an option localeMatcher
strategy passed to options bag, but this approach does now allow for building custom matching logic.
If we want to enable customers to design their own approaches, we need to allow them to pass a resolved locale, not a requested locale.
In many larger systems language negotiation is a chained operation. We're looking for the best locale that we have data for in many areas. For example, ideally, a multi-modal software may have to negotiate between:
In the world of monolithic software we were able to force release models that required that all locales in all of those were aligned. Which basically reduced this complexity to a single list of locals that all assets are available in negotiated against a single list that customer requested. In some cases the latter was limited to be a subset of the former, problem solved.
But this puts enormous stress on release models, forcing monolithic architectures like ICU4C where all data for all pieces is bundled and shipped and distributed and stored together.
In a more flexible model, I would like to allow customers to retrieve sufficient data to perform their own logic to select the optimal locale to use.
This may mean that all of the available
lists are intersected, and only that is negotiated. Or maybe some items are part of the negotiation (say, plural rules and number format), while others are not (date time format missing data should fallback down to und
if needed).
But the gist is that we should not assume that ICU4X is the only part of the UI locale negotiation and allow a wider negotiation to be performed that includes ICU4X available data in its own negotiation.
The tricky piece here is that we don't always know which options will be used, so we can't fail when key is missing - my software may have checked that date time patterns are available in a negotiated locale, but if at runtime I ask for Chinese era months in the negotiated locale, and its missing, then a fallback has to happen, rather than catastrophic failure.
I think for such scenario an ideal API would be something like this:
impl DateTimeFormat {
pub fn new(locale, options) -> Self;
pub fn get_data_keys(options): Vec<DataKey>;
pub fn get_available_lids_for_options(options): Vec<LanguageIdentifier>;
}
impl DataProvider {
pub fn get_available_lids_for_keys(&[DataKey]);
}
The former would allow me to retrieve available lids for options without constructing the API, so I can do:
let available_locales = DateTimeFormat::get_available_lids_for_options(options);
let selected_locales = negotite_languages(requested_locales, available_locales);
let selected_locale = selected_locales[0];
let resolved_options = options.merge(selected_locale);
let dtf = DateTimeFormat::new(selected_locale, resoled_options);
and the latter would allow me to do:
let dtf_keys = DateTimeFormat::get_keys_for_options(dtf_options);
let pr_keys = PluralRules::get_keys_for_options(pr_options);
let nf_keys = NumberFormat::get_keys_for_options(nf_options);
let all_keys = union(&[dtf_keys, pr_keys, nf_keys]);
let icu4x_available_lids = DataProvider::get_lids_for_keys(&all_keys);
let msg_lids = L10nRegistry::get_lids_for_messages(&["menu.ftl", "errors.ftl"]);
let available_lids = intersection(&[icu4x_available_lids, msg_lids]);
let selected_locales = negotiate_languages(available_lids, requested_locales);
let selected_locale = selected_locales[0];
let dtf_resolved_options = dtf_options.merge(selected_locale);
let dtf = DateTImeFormat::new(selected_locale, dtf_selected_options);
let nf_resolved_options = nf_options.merge(selected_locale);
let nf = NumberFormat::new(selected_locale, nf_selected_options);
let messages = L10nRegistry::get_messages(selected_locale, &["menu.ftl", "errors.ftl"]);
// now Message Formatting is safe to also use NumberFormat and DateTimeFormat
// knowing that the all of them have data for that selected locale
// and if there's another formatter that was not involved in negotiation, or a
// DTF is to be created somewhere deep in message resolution, with options
// that require a key that is not present in selected_locale, then that DTF should
// gracefully fallback to the next best locale based on `selected_locale`.
A crucial question here is whether it is important to discriminante the list of supported locales based on data key. I proposed several solutions that relax this constraint but give us a solution which is smaller, faster, and easier to implement. If we need this constraint, we need to decide whether this use case is niche enough that we require people to build data in hybrid mode (resulting in larger data size) or if we want to change the data providers to support this type of query natively, such as by storing this information in a side table, and then plumbing it through a trait, which we could do in 2.0.
The proposals I had in mind do not work for @zbraniecki use cases. To implement something like negotiate_languages
, one would need lists of locales. That said, a negotiate_languages
approach seems super complicated. How are regional variants and fallback handled? If a single key doesn't support the language, should the whole page fall back to root (or select another language)?
From talking with @zbraniecki about this extensively today, here is my understanding:
In general, locales could be supported in three ways: (a) fully supported, (b) supported via an approximate fallback, and (c) not supported. An example of (b) is Basque that resolves to French or Catalan that resolves to Spanish. CLDR does not ever perform "approximate" fallback; its fallback chain is always for "fully supported." However, it provides tools and algorithms such as the LocaleMatcher that can give scores about how close two languages are to each other.
@zbraniecki agrees that there is no such thing in general as a list of supported locales and that it is valid for us to simply return whether a particular requested locale is supported or not.
It is important to reason about different components (which we can model as data keys) having different supported locales. For example, a client could add a data overlay adding support for more currencies in more locales, and we should reflect that the additional locales are supported in the currency formatter, even if they are not supported in the datetime formatter. Please note that these locales could be added at runtime, so we can't depend on a datagen-time list.
I believe we can do everything @zbraniecki needs to do when datagen is run with --fallback hybrid
. However:
ResolveLocale
AO in ECMA-402, which returns the locale to which the requested locale resolved. We agreed during the ECMA-402 discussion that it would probably not be spec-compliant to unconditionally echo back the input locale.--fallback hybrid
.Can you post your design within these constraints @robertbastian?
Here's a low-cost proposal that might be "close enough" to get the job done and be spec-compliant.
We add a new fallback mode called --fallback thin
which is the same as --fallback hybrid
except that it only retains implicit locales if they fall back to root. For example:
Locale | Requested? | In CLDR? | Include in Hybrid? | Include in Thin? |
---|---|---|---|---|
es | Y | Y | Y | Y |
es-ES | N | Y | Y | N |
it | N | Y | Y | Y |
it-IT | N | Y | Y | N |
it-RU | Y | N | Y | Y |
Does that make sense? It basically means we end up with:
Explicit Locales | CLDR Locales | |
---|---|---|
Hybrid | Always | If ancestor or descendant |
Hybrid-Thin | Always | Only if parent is root |
Runtime-Thin | May be deduplicated | May be deduplicated, but include if parent is root |
Runtime | May be deduplicated | May be deduplicated |
Preresolved | Always | Ignore |
(EDIT: I added Runtime-Thin as another mode.)
To answer the two questions:
und
, return true. Else return false.I think a fallback mode is a great solution.
Really liking the datagen flag solution! I'm wondering if it would still be worth to offer some trait or wrapper that prevents users from passing a data provider that doesn't support those operations thanks to its fallback type. This would entail storing a single byte for that on the provider itself.
Here's an elaborate use case, let's run it through Shane's idea:
Amaya is a new web browser by a hip startup focused on AI generated LLM social crypto blockchain cold fusion DAO.
Amaya wants to use ICU4X for bulk of it's internationalization needs, but it also has a number of other localizable data sources, some of them data driven, others are algorithmic:
The browser would like to take customer requested list of locales, meaning ordered list of ICU4X Locale tags. It considers PluralRules, NumberFormat, TTS and MF2 resources to be "critical". In other words, it wants to take a union of locale availability of those three and only select the best locales from requested based on which ones those three resources are sufficiently available in. It considers DateTime, Timezone, and visual assets to be non critical, and are comfortable with those falling back to imperfect matches without regressing the main locale (Messages, Plurals, Numbers, TTS).
Variant: For Currencies, the authors may want to use better currencies even when the browser is not available in a given locale. Like, imagine customer requesting ["ar", "en"]
, and that Currencies are available for ar
, while because PluralRules are missing, the browser is in en
.
@zbraniecki It sounds like your use case boils down to the following question of ICU4X: "Does a given feature and locale have data in ICU4X?" Is that correct?
The approach I suggested above should be able to answer that question. Mechanically, attempt the data loading for your desired feature+locale; for efficiency, a DataRequestMetadata flag can be added so that this operation doesn't actually do work other than locale fallback. If you land on a locale other than the root locale, then your locale is supported. As noted previously, CLDR and ICU4X do not perform fuzzy fallback; if it finds data, then the feature is fully supported.
Really liking the datagen flag solution! I'm wondering if it would still be worth to offer some trait or wrapper that prevents users from passing a data provider that doesn't support those operations thanks to its fallback type. This would entail storing a single byte for that on the provider itself.
Yeah, it would be nice to avoid footguns in this area. The landscape is different for bake data vs buffer data. In bake data, we can generate impls based on whether or not the bundle was built with supported-locale prerequisites. In buffer data, we would either need to add a bit somewhere (perhaps in the metadata file on FsProvider or a new schema version on BlobProvider), or just tell people to be very careful when using RuntimeManual
fallback mode which is the only one that I think has this problem.
Fallback Mode | Supports "Supported Locale" Queries, Bake? | Blob? |
---|---|---|
Runtime | No | N/A |
RuntimeManual | No | No |
Hybrid | Yes | Yes |
Preresolved | Sort-of | Sort-of |
Runtime-Thin | Yes | Yes |
Hybrid-Thin | Yes | Yes |
In Preresolved mode, locale fallback is not supposed to take place at all, but a "supported locale" query would return correct results for the locales for which the bundle was built, which is why I put "Sort-of" in those boxes.
Given that RuntimeManual
is already a power-user feature, I'm kind-of okay with just documenting not to use that mode if you need supported-locale support.
In bake data, we can generate impls based on whether or not the bundle was built with supported-locale prerequisites
Wanted to dive in here a bit more. We could generate either "positive" impls (something that tells the data provider that supported-locale queries are supported) or "negative" impls (something that says we don't support such queries). I'm thinking that a negative impl might be easier, because I don't want to impose on all source providers that they set the flag. The runtime fallback baked data impl would set a field in DataResponseMetadata saying that this data does not support the supported-locale query. Whether or not to set the field could depend on the value of BuiltInFallbackMode
; we would need to add another variant to that enum. Note that we would want to set the field in the baked data even for hybrid mode, which does not currently generate fallbacker code or populate DataResponseMetadata::locale.
Here is a potential field for DataResponseMetadata:
#[non_exhaustive]
pub enum FallbackerType {
Basic,
Deduplicated,
NoFallback,
}
Baked data would set Basic
for RuntimeThin and Deduplicated
for Runtime. Supported locale queries would fail if they received a response with Deduplicated
as its reported fallbacker type. It would however need to permit responses that didn't specify their fallbacker type.
I think it should be enough to have a supports_querying_for_supported_locales_bikeshed: bool
field inside the metadata, right? It should default to false and providers would be responsible for setting it to true
if they support that functionality.
Hmm. Trying to think about the boolean condition that is easiest to explain and hardest to get wrong.
How about simply:
pub bool is_last_resort_fallback
: Whether the data payload comes from a last-resort fallback. The data has well-defined behavior but it may or may not be correctly localized into the requested language.The field defaults to false
. It is set to true
in the following situations:
LocaleFallbackProvider
, as used for most buffer provider fallbacking, falls back to the und
locale and und
was not the requested localeund
locale and und
was not the requested localeWith this logic, a source provider does not need any extra logic, which is great. If the source provider supports a locale, it just needs to make sure it returns non-und
for that locale.
Observation: even if datagen is run with RuntimeManual
, supported-locale queries work fine in positive cases; it's just that they might return false negatives. I'm okay with that caveat.
Okay let's bikeshed the datagen option. It could either be a new flag or a new fallback mode.
If a flag:
--always_include_base_language
: Always includes an entry for the language even if it is identical to und
. For example, en
decimal symbols are equivalent to und
decimal symbols; without this flag, en
may be removed from the exported data if runtime fallback is enabled. This mode is required for supported-locale queries to work at runtime.If a fallback mode:
--fallback runtime-augmented
= same as --fallback runtime
but always include an entry for the language even if it is identical to und
.--fallback runtime-manual-augmented
= same as --fallback runtime-manual
but always include an entry for the language even if it is identical to und
.Note: I don't think there needs to be any difference in locales between these two modes (which I had previously called RuntimeThin and HybridThin); it is only whether runtime fallback is enabled in baked data mode.
A separate flag would avoid the potential confusion of having more types of fallback that are just used in special cases, so I would advocate for that.
If we go with a flag, maybe we can invert the polarity:
--strip_base_languages
: Remove data in language or language-script if it is equal to the data in und
and the language was not explicitly requested. Slightly reduces data size, but causes supported-locale queries to fail.So then the nicer behavior is enabled by default, but we still give clients a way to achieve the minimal data size.
Current state of thinking, @sffc and @robertbastian:
can_load
or dry_run
function may be wasteful given that it is likely to be followed by a plain loadSome draft code:
@zbraniecki Thoughts on the above?
It's hard to think of a case where you'd want to query the provider for "is this locale supported" without following quickly with a real load for the data, in which case we don't need to special-case a "dry run" code path.
It's hard to think of a case where you'd want to query the provider for "is this locale supported" without following quickly with a real load for the data
ECMA-402 does precisely this, right? It repeatedly checks the parents of a single locale to see if it has support for the data (which could take several iterations if the locale is long enough), and moves the data fetching part of the pipeline after checking that all JS options are correct. Or are you talking about a separate thing entirely?
Right, that's a valid use case that's not check-then-load, so we won't be able to optimise the double lookup. I think a can_load is the best solution for this.
Another observation. DataProvider<M>
calls are likely to be inlined since the function is highly generic. If the payload is not being used, the compiler has the opportunity to remove code generating it (so long as it can tell that the code has no side-effects).
I tend to think that we should just say that calling .load()
to check if a locale is supported should be what we recommend, rather than putting a lot of time into what might be premature optimization.
ECMA-402 does precisely this, right? It repeatedly checks the parents of a single locale to see if it has support for the data (which could take several iterations if the locale is long enough), and moves the data fetching part of the pipeline after checking that all JS options are correct. Or are you talking about a separate thing entirely?
ECMA-402 requires taking a list of locales and checking which ones are supported. The use case of that function is for clients to pick a locale and then follow up with additional user code that might try to load data from that locale. So in a vacuum, ECMA-402 does not need the loaded values, but supportedLocalesOf
is not typically used in a vacuum.
ECMA-402 requires taking a list of locales and checking which ones are supported. The use case of that function is for clients to pick a locale and then follow up with additional user code that might try to load data from that locale. So in a vacuum, ECMA-402 does not need the loaded values, but supportedLocalesOf is not typically used in a vacuum.
I was more talking about things like the BestFitMatcher
operation, which could make several data fetches just to resolve the preferred locale of a constructor call. However, I have to agree that this seems like a premature optimization, and users that want to optimise for data loads can just cache the previous results.
BestFitMatcher should be implemented by simply looping over the list and selecting the first one that doesn't return und
.
Shouldn't "BestFitMatcher" implement an "advanced" version of "LookupMatcher"? Or at least that's my expectation as an user, even if engines don't really implement that properly.
IIRC, no engine implements a more clever algorithm here.
Yes, but that brings an interesting question to the table: is that because the engines don't want to differ too much from "LookupMatcher", or because there aren't "proper" APIs in ICU4C for exploratory queries such as can_load
?
I very often see clients who want to use ICU as a default behavior, but fall back to custom logic if ICU does not support a given locale.
The main problem, of course, is that the locale fallback chain is an essential piece of whether or not a locale is supported. If you have locale data for "en" and "en_001", but request "en_US" or "en_GB", the answer is that both of those locales are supported, even though they both load their data from a fallback locale.
I'm not 100% confident, but I think the prevailing use case is that programmers want to know whether the locale falls back all the way to root. If it gets "caught" by an intermediate language, then that's fine, as long as we don't use the stub data in root.
ECMA-402 has the concept of supportedLocalesOf. Although it's not clear in the MDN, it appears that this method has the ability to check for locale fallbacks. This is better than ICU4C's behavior of getAvailableLocales, which returns a string list and requires the user to figure out how to do fallbacks and matching on that list.
We could consider whether this use case fits in with the data provider, or whether we want to put it on APIs directly.