unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.37k stars 176 forks source link

ICU4X objects that try_new with a provider should store and expose the resolved locale #3906

Open hsivonen opened 1 year ago

hsivonen commented 1 year ago

ECMA-402 requires various objects to be able to expose the resolved options. Common across different types is the resolved locale.

For ECMA-402 compat, we should make various ICU4X objects call take_metadata_and_payload instead of take_payload when loading their primary provider-backed payload and store the DataLocale from the metadata. We should then have a convention across ICU4X for retrieving that Locale from the ICU4X object.

The finer points of DataLocale vs. Locale are unclear to me, so I'm not sure if the convention should be fn resolved_locale(&self) -> &DataLocale allowing the application to call .into_locale() or fn resolved_locale(&self) -> Locale.

sffc commented 1 year ago

What is the "primary provider-backed payload"? It's not well defined in all cases.

The whole concept of the "resolved locale" is fraught and it's not up to ICU4X to perpetuate it. #58 has suggestions for how to solve the 402 problem in user land.

hsivonen commented 1 year ago

What is the "primary provider-backed payload"? It's not well defined in all cases.

For the collator, it would be the payload for CollationMetadataV1Marker.

The whole concept of the "resolved locale" is fraught and it's not up to ICU4X to perpetuate it. https://github.com/unicode-org/icu4x/issues/58 has suggestions for how to solve the 402 problem in user land.

My understanding of provider internals is insufficient for understanding what exactly #58 proposes for the ECMA-402 resolved locale info, especially in the case where the glue code opted to implement "best fit" by delegating to ICU4X's lookup mechanism.

Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?

Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?

sffc commented 1 year ago

Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?

I think it would be fine if ICU4X suggested which key to use for the resolved locale in cases required by 402

Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?

Neither. The data provider should be instrumented to get the resolved locale out of the DataResponseMetadata.

hsivonen commented 1 year ago

Either way, which key should the glue code query for if the primary payload for a given ICU4X object isn't well defined in all cases?

I think it would be fine if ICU4X suggested which key to use for the resolved locale in cases required by 402

Does it make sense to merely suggest it as opposed to providing a concrete crate for it with documentation that the crate is only provided for 402 compat? Or providing a Cargo option to enable such code in each component directly? I wouldn't mind if the option was named to discourage use along the lines of enable_conceptually_questionable_resolved_locale_for_ecma_402_compat_only.

Should an application do what Boa does, and try to load data payloads ahead of actual ICU4X object instantiation with the assumption that the preflight with match the actual instantiation or should the application load whatever is considered the primary payload lazily after the fact if the code calling ECMA-402 APIs requests the resolved options?

Neither. The data provider should be instrumented to get the resolved locale out of the DataResponseMetadata.

The example seems to preclude the use of the baked-mode constructors that don't take a provider argument, which is unfortunate. I'm not sure, but my initial reaction is that I'd rather do a duplicative lookup if the JS app looks at the resolved options than defeat the baked code path for object construction.

robertbastian commented 1 year ago

Would it be incorrect to always return the requested locale as the resolved locale?

hsivonen commented 1 year ago

Would it be incorrect to always return the requested locale as the resolved locale?

That question can be understood in at least three senses: 1) what the caller wants to know if they care to actually inspect the resolved locale, 2) what's Web-compatible, 3) what fits within the spec's notion of implementation-defined.

In sense 1, incorrect. (Most notably, if the requested locale has a non-language component and the resolved locale does not retain that component, this shows that the implementation's data does not explicitly alter the main flavor of the language in a way that the component would change. Is this actionable information for the caller? Perhaps not.)

In sense 2, probably not Web-compatible considering that things that deviate from what major browsers do tends not to be Web-compatible but maybe Web-compatible in the sense that the information isn't really that actionable anyway.

In sense 3, maybe not strictly incorrect if you read all implementation-defined behavior as not required to even make sense and the observer not getting an infinite number of observations.

robertbastian commented 1 year ago

Returning the requested locale as the resolved locale gives the same result as preresolving locales at datagen time. If we don't have es-419 data and we fall back to es, I don't think the information whether the data is actually es-419 is actionable.

The bigger problem is falling back to und. Maybe we can have a flag in locid_transform that changes fallback behaviour to return an error when und is reached. This would need to be key-dependent, as we know for collation, for example, fallback to und is fine.

I agree that a solution that is compatible with compiled data would be preferrable.

hsivonen commented 1 year ago

I didn't look at the source, but I think Intl.Segmenter, which is a rather special case in its relationship to locale data, in Chrome fakes its resolved locale roughly by 1) considering languages that CLDR knows about in some sense as supported and 2) retaining the region if the region is CLDR-known to be associated with the language (sv-FI retains the region, but fi-SE does not).

robertbastian commented 1 year ago

Do you know any concrete uses of this information, which would break if we deviate? I don't want to let Chrome dictate how standards should be interpreted.

hsivonen commented 1 year ago

From a very quick look at GitHub search, I see one use case beyond test cases and debug logging:

Determining the host locale by executing Intl.DateTimeFormat().resolvedOptions().locale per StackOverflow and various other teaching materials.

So perhaps just echoing back the requested locale could work and not break the Web.

Of course, this only looks at the case where .locale is appended directly to .resolvedOptions() instead of there being an intermediate variable.

sffc commented 1 year ago

Yeah, echoing back the requested locale probably works. I think the most useful piece of information you can get is which one out of a list of locales you got. For example, if the locales requested were ["ff", "fr", "ar"], it is potentially useful to know that "fr" was chosen out of that list, so that you can for example render other components in that language.

hsivonen commented 1 year ago

Yeah, echoing back the requested locale probably works.

Should ECMA-402 change to require this?

sffc commented 1 year ago

@zbraniecki Thoughts on the above?

zbraniecki commented 1 year ago

The web reality is that it will return the closest locale the engine had data for:

(new Intl.DateTimeFormat("es-FR")).resolvedOptions().locale == "es"
hsivonen commented 1 year ago

The key question is whether ICU4X should push the first implementation to ship ICU4X-backed ECMA-402 to the Web to bear the cost of finding out if deviating from the current Web reality is Web-compatible.

Given that ICU4X is supposed to work as an ECMA-402 back end, it would be rather odd for ICU4X to resist being able to implement what ECMA-402 currently says in a similar way to how deployed implementations do it.

Perhaps echoing back the requested locale would work, but is e.g. Chrome willing to try it out to see if it's Web-compatible?

I suggest doing what I originally requested but behind a repulsively-named Cargo option. That is, if conceptually_questionable_resolved_locale_for_ecma_402_compat_only is enabled, each ICU4X objects that has try_new with locale would gain a field for the resolved locale, a getter for that field, and would store the locale from the appropriate payload metadata in that field in the code that loads the data.

sffc commented 1 year ago
hsivonen commented 1 year ago

@jedel1043 , see above:

Do we know how much Boa wants to match Node/Chrome behavior?

jedel1043 commented 1 year ago

@jedel1043 , see above:

Do we know how much Boa wants to match Node/Chrome behavior?

Personally, I think it's alright if Boa has to deviate from V8 in order to offer better locale results.

sffc commented 1 year ago

Based on https://github.com/tc39/ecma402/issues/830#issuecomment-1711965221, there could a way for datagen to store the list of locales for which it generated data and then some API to access that list. Needs design work.

Discuss with:

Optional:

robertbastian commented 8 months ago

Can we merge the discussion into #58? It seems to be going the same direction

Manishearth commented 8 months ago

In favor of merging.

sffc commented 8 months ago

Not exactly the same set of questions. This thread is about ResolvedLocale and #58 is about SupportedLocales. The solutions may overlap.

robertbastian commented 8 months ago

In #58 we have concluded that exposing a set of supported locales is not feasible, but determining the resolved locale is. You literally have a draft ResolvedLocalesAdapter for that issue, so I really struggle to understand what the difference is.

sffc commented 8 months ago

I changed https://github.com/unicode-org/icu4x/pull/4607 to be closing this issue.

sffc commented 7 months ago
sffc commented 4 months ago

Also see comment from @mihnita in https://github.com/unicode-org/icu4x/issues/2237#issuecomment-1201557171

robertbastian commented 4 months ago

2.0 blocking question: does baked data do this? From preliminary tests there might a non-trivial size impact, I'll get some better numbers.

sffc commented 2 months ago

I still think retain-base-languages should fix this, but currently it does not fill in the missing locales:

$ cargo run -p icu4x-datagen -- --deduplication retain-base-languages --markers collator/data@1 collator/meta@1 --locales full --format dir --out /tmp/collator_retain -W
warning: ignoring `resolver` config table without `-Zmsrv-policy`
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.11s
     Running `target/debug/icu4x-datagen --deduplication retain-base-languages --markers 'collator/data@1' 'collator/meta@1' --locales full --format dir --out /tmp/collator_retain -W`
2024-08-08T18:07:18.825Z INFO  [icu_provider_export::export_impl] Datagen configured with deduplication retaining base languages, and these locales: ["<all>"]
2024-08-08T18:07:19.387Z INFO  [icu_provider_export::export_impl] Generated marker collator/meta@1 (0.561s, 'si/dict' in 1.829ms, flushed in 4.183µs)
2024-08-08T18:07:19.529Z INFO  [icu_provider_export::export_impl] Generated marker collator/data@1 (0.704s, 'zh/stroke' in 0.352s, flushed in 4.723µs)
$ ls /tmp/collator_retain/collator/meta@1/
af.json  br.json       dict              et.json       gu.json   ig.json  kok.json  ml.json  pa.json   sk.json       te.json   unihan
am.json  bs-Cyrl.json  dsb.json          fa.json       ha.json   is.json  ku.json   mn.json  phonebk   sl.json       th.json   ur.json
ar.json  bs.json       ee.json           ff-Adlm.json  haw.json  ja.json  ky.json   mr.json  phonetic  smn.json      tk.json   uz.json
as.json  ceb.json      el.json           fi.json       he.json   ka.json  lkt.json  mt.json  pl.json   sq.json       to.json   vi.json
az.json  chr.json      emoji             fil.json      hi.json   kk.json  ln.json   my.json  ps.json   sr.json       trad      wo.json
be.json  compat        en-US-posix.json  fo.json       hr.json   kl.json  lo.json   ne.json  ro.json   sr-Latn.json  tr.json   yi.json
bg.json  cs.json       eo.json           fr-CA.json    hsb.json  km.json  lt.json   no.json  ru.json   stroke        ug.json   yo.json
bn.json  cy.json       eor               fy.json       hu.json   kn.json  lv.json   om.json  se.json   sv.json       uk.json   zh.json
bo.json  da.json       es.json           gl.json       hy.json   ko.json  mk.json   or.json  si.json   ta.json       und.json  zhuyin
sffc commented 2 months ago