Open hsivonen opened 1 year ago
What is the "primary provider-backed payload"? It's not well defined in all cases.
The whole concept of the "resolved locale" is fraught and it's not up to ICU4X to perpetuate it. #58 has suggestions for how to solve the 402 problem in user land.
What is the "primary provider-backed payload"? It's not well defined in all cases.
For the collator, it would be the payload for CollationMetadataV1Marker
.
The whole concept of the "resolved locale" is fraught and it's not up to ICU4X to perpetuate it. https://github.com/unicode-org/icu4x/issues/58 has suggestions for how to solve the 402 problem in user land.
My understanding of provider internals is insufficient for understanding what exactly #58 proposes for the ECMA-402 resolved locale info, especially in the case where the glue code opted to implement "best fit" by delegating to ICU4X's lookup mechanism.
Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?
Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?
Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?
I think it would be fine if ICU4X suggested which key to use for the resolved locale in cases required by 402
Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?
Neither. The data provider should be instrumented to get the resolved locale out of the DataResponseMetadata.
Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?
I think it would be fine if ICU4X suggested which key to use for the resolved locale in cases required by 402
Does it make sense to merely suggest it as opposed to providing a concrete crate for it with documentation that the crate is only provided for 402 compat? Or providing a Cargo option to enable such code in each component directly? I wouldn't mind if the option was named to discourage use along the lines of enable_conceptually_questionable_resolved_locale_for_ecma_402_compat_only
.
Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?
Neither. The data provider should be instrumented to get the resolved locale out of the DataResponseMetadata.
The example seems to preclude the use of the baked-mode constructors that don't take a provider argument, which is unfortunate. I'm not sure, but my initial reaction is that I'd rather do a duplicative lookup if the JS app looks at the resolved options than defeat the baked code path for object construction.
Would it be incorrect to always return the requested locale as the resolved locale?
Would it be incorrect to always return the requested locale as the resolved locale?
That question can be understood in at least three senses: 1) what the caller wants to know if they care to actually inspect the resolved locale, 2) what's Web-compatible, 3) what fits within the spec's notion of implementation-defined.
In sense 1, incorrect. (Most notably, if the requested locale has a non-language component and the resolved locale does not retain that component, this shows that the implementation's data does not explicitly alter the main flavor of the language in a way that the component would change. Is this actionable information for the caller? Perhaps not.)
In sense 2, probably not Web-compatible considering that things that deviate from what major browsers do tends not to be Web-compatible but maybe Web-compatible in the sense that the information isn't really that actionable anyway.
In sense 3, maybe not strictly incorrect if you read all implementation-defined behavior as not required to even make sense and the observer not getting an infinite number of observations.
Returning the requested locale as the resolved locale gives the same result as preresolving locales at datagen time. If we don't have es-419
data and we fall back to es
, I don't think the information whether the data is actually es-419
is actionable.
The bigger problem is falling back to und
. Maybe we can have a flag in locid_transform that changes fallback behaviour to return an error when und
is reached. This would need to be key-dependent, as we know for collation, for example, fallback to und
is fine.
I agree that a solution that is compatible with compiled data would be preferrable.
I didn't look at the source, but I think Intl.Segmenter
, which is a rather special case in its relationship to locale data, in Chrome fakes its resolved locale roughly by 1) considering languages that CLDR knows about in some sense as supported and 2) retaining the region if the region is CLDR-known to be associated with the language (sv-FI
retains the region, but fi-SE
does not).
Do you know any concrete uses of this information, which would break if we deviate? I don't want to let Chrome dictate how standards should be interpreted.
From a very quick look at GitHub search, I see one use case beyond test cases and debug logging:
Determining the host locale by executing Intl.DateTimeFormat().resolvedOptions().locale
per StackOverflow and various other teaching materials.
So perhaps just echoing back the requested locale could work and not break the Web.
Of course, this only looks at the case where .locale
is appended directly to .resolvedOptions()
instead of there being an intermediate variable.
Yeah, echoing back the requested locale probably works. I think the most useful piece of information you can get is which one out of a list of locales you got. For example, if the locales requested were ["ff", "fr", "ar"]
, it is potentially useful to know that "fr" was chosen out of that list, so that you can for example render other components in that language.
Yeah, echoing back the requested locale probably works.
Should ECMA-402 change to require this?
@zbraniecki Thoughts on the above?
The web reality is that it will return the closest locale the engine had data for:
(new Intl.DateTimeFormat("es-FR")).resolvedOptions().locale == "es"
The key question is whether ICU4X should push the first implementation to ship ICU4X-backed ECMA-402 to the Web to bear the cost of finding out if deviating from the current Web reality is Web-compatible.
Given that ICU4X is supposed to work as an ECMA-402 back end, it would be rather odd for ICU4X to resist being able to implement what ECMA-402 currently says in a similar way to how deployed implementations do it.
Perhaps echoing back the requested locale would work, but is e.g. Chrome willing to try it out to see if it's Web-compatible?
I suggest doing what I originally requested but behind a repulsively-named Cargo option. That is, if conceptually_questionable_resolved_locale_for_ecma_402_compat_only
is enabled, each ICU4X objects that has try_new
with locale would gain a field for the resolved locale, a getter for that field, and would store the locale from the appropriate payload metadata in that field in the code that loads the data.
und
or to determine which locale from a list got used.hi-Latn
falling back to en
maybe is okay for symbols but not patterns. Maybe en-IN
is okay but not en-001
. You could come up with a strategy for plurals saying, it's okay for it to fall back across countries but not across languages otherwise. For datetime symbols, you could have fallback multiple times because there are multiple keys, theoretically capable of multiple answers, barring runtime fallback mode. If you really want the correct answer, you need hybrid mode, not runtime fallback mode. The fact that it's key-dependent means I don't see a way of doing this well, and that's even before deciding the threshold for each use case. The main use case seems to be, if fallback went too far, you need to load more data. But "going too far" is not something we can define, it's use case dependent and key dependent and a mess.@jedel1043 , see above:
Do we know how much Boa wants to match Node/Chrome behavior?
@jedel1043 , see above:
Do we know how much Boa wants to match Node/Chrome behavior?
Personally, I think it's alright if Boa has to deviate from V8 in order to offer better locale results.
Based on https://github.com/tc39/ecma402/issues/830#issuecomment-1711965221, there could a way for datagen to store the list of locales for which it generated data and then some API to access that list. Needs design work.
Discuss with:
Optional:
Can we merge the discussion into #58? It seems to be going the same direction
In favor of merging.
Not exactly the same set of questions. This thread is about ResolvedLocale and #58 is about SupportedLocales. The solutions may overlap.
In #58 we have concluded that exposing a set of supported locales is not feasible, but determining the resolved locale is. You literally have a draft ResolvedLocalesAdapter for that issue, so I really struggle to understand what the difference is.
I changed https://github.com/unicode-org/icu4x/pull/4607 to be closing this issue.
Also see comment from @mihnita in https://github.com/unicode-org/icu4x/issues/2237#issuecomment-1201557171
2.0 blocking question: does baked data do this? From preliminary tests there might a non-trivial size impact, I'll get some better numbers.
I still think retain-base-languages
should fix this, but currently it does not fill in the missing locales:
$ cargo run -p icu4x-datagen -- --deduplication retain-base-languages --markers collator/data@1 collator/meta@1 --locales full --format dir --out /tmp/collator_retain -W
warning: ignoring `resolver` config table without `-Zmsrv-policy`
Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
Running `target/debug/icu4x-datagen --deduplication retain-base-languages --markers 'collator/data@1' 'collator/meta@1' --locales full --format dir --out /tmp/collator_retain -W`
2024-08-08T18:07:18.825Z INFO [icu_provider_export::export_impl] Datagen configured with deduplication retaining base languages, and these locales: ["<all>"]
2024-08-08T18:07:19.387Z INFO [icu_provider_export::export_impl] Generated marker collator/meta@1 (0.561s, 'si/dict' in 1.829ms, flushed in 4.183µs)
2024-08-08T18:07:19.529Z INFO [icu_provider_export::export_impl] Generated marker collator/data@1 (0.704s, 'zh/stroke' in 0.352s, flushed in 4.723µs)
$ ls /tmp/collator_retain/collator/meta@1/
af.json br.json dict et.json gu.json ig.json kok.json ml.json pa.json sk.json te.json unihan
am.json bs-Cyrl.json dsb.json fa.json ha.json is.json ku.json mn.json phonebk sl.json th.json ur.json
ar.json bs.json ee.json ff-Adlm.json haw.json ja.json ky.json mr.json phonetic smn.json tk.json uz.json
as.json ceb.json el.json fi.json he.json ka.json lkt.json mt.json pl.json sq.json to.json vi.json
az.json chr.json emoji fil.json hi.json kk.json ln.json my.json ps.json sr.json trad wo.json
be.json compat en-US-posix.json fo.json hr.json kl.json lo.json ne.json ro.json sr-Latn.json tr.json yi.json
bg.json cs.json eo.json fr-CA.json hsb.json km.json lt.json no.json ru.json stroke ug.json yo.json
bn.json cy.json eor fy.json hu.json kn.json lv.json om.json se.json sv.json uk.json zh.json
bo.json da.json es.json gl.json hy.json ko.json mk.json or.json si.json ta.json und.json zhuyin
--deduplication retain-base-languages --markers collator/data@1 --locales full --format dir
, languages missing are?
collator/meta@1
it still doesn't retain base languages)retain-base-languages
doesn't include languages unknown to data, such as Klingon tlh
. As far as collator data is concerned, en
is like Klingon because there is no data for it. We could change retain-base-languages
behavior, or we could add en
to the ICU Export Data.
ECMA-402 requires various objects to be able to expose the resolved options. Common across different types is the resolved locale.
For ECMA-402 compat, we should make various ICU4X objects call
take_metadata_and_payload
instead oftake_payload
when loading their primary provider-backed payload and store theDataLocale
from the metadata. We should then have a convention across ICU4X for retrieving thatLocale
from the ICU4X object.The finer points of
DataLocale
vs.Locale
are unclear to me, so I'm not sure if the convention should befn resolved_locale(&self) -> &DataLocale
allowing the application to call.into_locale()
orfn resolved_locale(&self) -> Locale
.