unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.33k stars 174 forks source link

Decide on exposing LanguageIdentifier #64

Closed sffc closed 3 years ago

sffc commented 4 years ago

This is a follow-up to https://github.com/unicode-org/icu4x/pull/43#discussion_r413329507.

There are some compelling arguments that we should expose only Locale and not LanguageIdentifier as public API in ICU4X. This issue is a reminder to revisit this discussion once the rest of unic-locale is rolled in and we are able to perform more testing.

zbraniecki commented 4 years ago

It may be helpful to list the arguments brought by @nciric:

  1. Having 2 similar APIs (Locale vs Language) will be confusing
  2. Ecma402 doesn't have Language Id, and I am not seeing demand for it
  3. It feels like a low lever struct, useful for resource loading (CLDR) that doesn't have a place in public surface
zbraniecki commented 4 years ago

Short response:

  1. I don't believe that having those two is confusing in my experience of working with those crates. In particular since From/Into between them is infallible and cheap.
  2. ECMA402 has different design characteristics from ICU4X. Low level performance and memory optimization is not as critical for JS environment as it is for constrain systems programming.
  3. It's an opinion, and I hold an opposite one :)

Longer response:

Locale is a strict superset of LanguageIdentifier.

I see two major reasons to invest in LanguageIdentifier, at least in short term:

1) API maturity 2) Performance/Memory

API maturity

LanguageIdentifier API is simple. It is a struct with 4 fields. It's a great initial API for us to work with - it's simple, cheap, requires no external data and contains useful foundational building block required for any internationalization operation. Locale API surface is multiple times bigger. Bringing that to maturity will take longer, there are more open questions and considerations to take into account.

This reason will go away once we get sufficient eyes on the Locale API, test coverage and in-field experience.

Performance/Memory

When working with software like Firefox, we handle high number of language identifiers during app life cycle. During a startup, we handle around 10e2-10e3 of them, depending on many environmental and system factors, and we need to mostly validate, parse them, store them in some lists, sort them, and match against each other. They're used for app user interface negotiations, various subcomponents negotiations, fonts, geolocation etc. I expect similar operations to be shared by OSes such as Android, Fuschia, MacOS, Windows etc.

The underlying struct TinyStr used for storing of the subtags and extensions is extremely memory efficient and fast for validation, canonicalization and comparisons.

But even with that, on the conservative end, LanguageIdentifier using 32 bytes allows us to store 10e3 of them in 320kb of memory. Storing the same list of language identifier as Locale would take around 1.84mb.

This memory is not all used at once, and more often it'll be on the lower end, but the drastic difference, combined with the fact that in many environments LanguageIdentifier contains all of the information provided by the identifier, imho justifies exposing the cheaper struct.

Similar consideration happens with performance. Constructing 21 language identifiers on my high-end laptop takes ~567ns. Constructing the same identifiers into Locale takes ~1.4350us or ~3x longer. Comparing 21 LanguageIdentifier instances takes ~11ns on my laptop, comparing two Locale instances takes ~192ns.

I'm the first to admit that Locale is just not as well optimized as LanguageIdentifier is. The reason is that its API surface is so much bigger. There are many low hanging fruits and my hope is that we'll get Locale perf characteristic to much closer match that of LanguageIdentifier. In fact, the numbers I showed above are already ~21% better for this PR than for what I had in unic_locale because I started optimizing. :) I also recognize that Locale is still multiple times faster and cheaper than ICU4C Locale already.

But I don't know if it will be possible to erase this difference and I'd prefer to wait until the numbers are closer and the API maturity is more complete before we decide on whether we want to keep LanguageIdentifier struct public.

The good news is that since Locale is a strict super set, if we can get those two factors in check, it should be very straightforward to upgrade from one to another and if we do this before 1.0 I'm comfortable with the versioning and compatibility story.

sffc commented 4 years ago

Additional arguments that I brought up (all in favor of not exposing LanguageIdentifier):

  1. Exposing LanguageIdentifier gives a hint, and I would venture to say a strong hint, to clients that maybe this is something that they should be using, when in reality, clients should almost always be using the Locale type.
    • Developers new to the field know language and region subtags, like "en-US", and maybe they know script subtags, like "sr-Cyrl", but knowledge of the importance of Unicode extension keywords is much more limited. If the gold standard i18n library gives them a shiny-looking LangaugeIdentifier type, touting its performance, developers might think they should use it as the go-to type for locale strings.
  2. If performance is the main use case, there is no such thing as a one-size-fits-all solution. Performance needs are unique to each client. I brought up the example of how ICU LanguageMatcher, used throughout the industry, matches only on Language, Script, and Region (LSR), not Variants.
  3. Even if we decided that we really wanted to introduce a compact, performant type, it seems to me that a very important feature of such a type would be the ability to implement Copy, and LanguageIdentifier is unable to implement Copy due to supporting an unbounded number of variants (a requirement I questioned in #52).
  4. The choice of language, script, region, and variants is arbitrary. Yes, it's backed by the Unicode Language Identifier spec, but just because something is in a spec doesn't mean we have to implement it. I have not yet seen a compelling reason why this particular set of subtags is more useful than other arbitrary sets of subtags.
sffc commented 4 years ago

The underlying struct TinyStr used for storing of the subtags and extensions is extremely memory efficient and fast for validation, canonicalization and comparisons.

I like TinyStr. Why aren't we using it in Locale? In other words, why can't Locale be something like:

struct Locale {
    language: TinyStr4,
    script: Option<TinyStr4>,
    region: Option<TinyStr4>,
    extendedSubtags: Option<Box<ExtendedSubtags>>,
}

struct ExtendedSubtags {
    // variants, Unicode extensions, etc.
}

In the common case, extendedSubtags is None. Only when you have some of the less common subtags do you trigger the Box to be allocated.

This may or may not be the optimal subset of subtags (maybe you want to support at least a single variant in the main Locale struct, for example), but I don't see why Locale and LanguageIdentifier need to be fundamentally different in their memory footprint.

This memory is not all used at once, and more often it'll be on the lower end, but the drastic difference, combined with the fact that in many environments LanguageIdentifier contains all of the information provided by the identifier, imho justifies exposing the cheaper struct.

But, do the "many environments" guarantee that they won't add Unicode extension keywords?

I see it as a chicken-and-egg problem. Apps use LanguageIdentifier because operating systems don't put Unicode extensions in their BCP47 tags. But, if operating systems started putting them in, then apps wouldn't be able to interpret them.

In my opinion, as a Unicode subcommittee, we need to appreciate the extra responsibility we take on to evangelize for best i18n practices. More people should be using Unicode extensions, and advertising a type that doesn't support them would be a disservice to Unicode, to our clients, and even to end users.

zbraniecki commented 4 years ago

Additional arguments that I brought up (all in favor of not exposing LanguageIdentifier):

1. Exposing LanguageIdentifier gives a hint, and I would venture to say a _strong_ hint, to clients that maybe this is something that they should be using, when in reality, clients should almost always be using the Locale type.

I think this argument could be rephrased to We should use the API to educate the users on how to internationalize their software. I believe that statement to be much more true in the context of ECMA402 than in the context of ICU4X.

In particular, I find it pretty unconvincing that we should not expose a perfectly usable API which implements a publicly available standard, because "we don't want to give a hint that it could be used, while we believe most of users should use the other". It's a bit like if Rust was exposing only the "preferred" methods and used this rationale not to exposed the ones that Rust authors consider "people might want to use them, while they probably should use the one we find the most common".

This way of thinking is convincing for me in the context of ECMA402 much more, while in context of ICU4X I'd prefer to aim for minimal API surface for reasons more integral to maintenance, than education.

   * Developers new to the field know language and region subtags, like "en-US", and maybe they know script subtags, like "sr-Cyrl", but knowledge of the importance of Unicode extension keywords is much more limited. If the gold standard i18n library gives them a shiny-looking LangaugeIdentifier type, touting its performance, developers might think they should use it as the go-to type for locale strings.

Documentation is a perfect place to explain the differences, tradeoffs, and provide recommendation. Here's an excellent entry on std::collections explaining when to use which of them: https://doc.rust-lang.org/std/collections/index.html

In most cases, users should use just a couple of them, but there are tradeoffs, and system's level software engineers may encounter use cases for others based on cyclical complexity, allocations, and performance of some operations.

Again, this is not the kind of education documentation I'd expect ECMA402 to provide on the collections in JavaScript, but it is one I'd expect to see in C++, Java or Rust.

2. If performance is the main use case, there is no such thing as a one-size-fits-all solution.  Performance needs are unique to each client.  I brought up the example of how ICU LanguageMatcher, used throughout the industry, matches only on Language, Script, and Region (LSR), not Variants.

There are two structures described in UTS #35 in the chapeter "Language and Locale Identifiers". I implemented exactly those two. Thanks to the way they're implemented, users are free to implement others. The exact point at which LanguageIdentifier stops is also one which gives a really good perf/memory balance.

3. _Even if_ we decided that we really wanted to introduce a compact, performant type, it seems to me that a very important feature of such a type would be the ability to implement `Copy`, and LanguageIdentifier is unable to implement `Copy` due to supporting an unbounded number of variants (a requirement I questioned in #52).

Can you elaborate on why do you believe that implementation of Copy is a very important feature?

4. The choice of language, script, region, and variants is arbitrary.  Yes, it's backed by the Unicode Language Identifier spec, but just because something is in a spec doesn't mean we have to implement it.  I have not yet seen a compelling reason why this particular set of subtags is more useful than other arbitrary sets of subtags.

Oh, ok, that's much larger point! I think that this is a discussion worth having on its own, but it also may be a good reason to kick off a separate conversation on how we want to approach our own ideas for diverging from Unicode. Do we want to strictly communicate upstream? Do we want to use our intuition to decide where and how to diverge? Do we want to somehow document, explain and possibly attempt to standardize our diversions? How do we plan to handle resulting misalignments between other systems that follow Unicode and our divergences?

Combing back to your point. Assuming you don't also intend to question the Locale structure, LanguageIdentifier can be seen as Locale - Extensions. That seems to be the intention behind it in Unicode, further repeated by W3C - https://www.w3.org/TR/ltli/

Language tags can provide information about the language, script, region, and language variation using subtags. But sometimes there are international preferences that do not correlate directly with any of these. For example, many cultures have more than one way of sorting content items, and so the appropriate sort ordering cannot always be inferred from the language tag by itself. So, for example, German language users might want to choose between the sort orderings used in a dictionary versus in a phone book.

One way to indicate these preferences is via registered Extensions to [BCP47]. The Unicode Common Locale Data Repository project [CLDR] maintains two such extensions: [RFC6497] defines an extension that describes transformations (generally text transformations, such as transliteration between scripts). [RFC6067] defines Unicode locales, which provide the ability to specify in a language tag a number of the international preference variations that users or content authors might wish to specify directly (such as the German dictionary/phone book difference described above).

My experience matches the line of thinking used by both W3C and Unicode - Language Identifiers describe a language information. It contains language subtag itself, script subtag because some languages use multiple, region subtag because many languages operate in multiple regions, and variants because some LSR combinations have additional variants that don't fit in the LSR subtag. The same line of thinking seems to be represented in BCP47. For example software translations will be well and sufficiently described by a LanguageIdentifier such as sl-nedis (Slovenian, Nediz dialect), de-CH-1996 (Swiss German as written using the spelling reform beginning in the year 1996) or ja-JP-macos (Japanese used in Japan with MacOS specific vocabulary).

Extensions provide additional information, either internationalization preferences (Unicode), or information about transformations applied, or private use.

In the context of system software, it seems to me that Locale and LanguageIdentifier are two units that quite fully describe all use cases I can think of, with most user-facing operations requiring Locale, while some lower-level ones being able to use just the LanguageIdentifier (negotiation, font selection, language resource description, etc.).

I can imagine that being my bias, since I follow Unicode quite closely, but working on a web engine also W3C seems to fit into this perfectly, and with modern software languages also following Unicode (Rust, Swift etc.), it seems to me quite natural to use those two units, rather than some other combination of subtags and extensions.

Finally, the design I'm proposing in the current PR, thanks to suggestions from @nciric makes each subtag and extension also a structure on its own which should make it possible to freely mix and match needed field and construct any collections you deem necessary for your task (such as LSR).

My argument for stopping on LanguageIdentifier for now is therefore that it seems to serve as a good middle-point between nothing and Locale, it is well defined in Unicode standard, used in W3C, DOM and JS standards and seems to sufficiently lend itself to quite a large range of tasks powering Gecko and Firefox successfully for several years now (and other engines in one way or another, via web standards implementations).

I'm definitely a proponent of moving toward Locale, since it is a superset, both for ICU4X and for Web Standards (ECMA402, [navigator.locales](https://bugzilla.mozilla.org/show_bug.cgi?id=1303579) proposal etc.), but having a stepping-stone toward it seems natural, and LanguageIdentifier seems like a good one for that.

zbraniecki commented 4 years ago

I like TinyStr. Why aren't we using it in Locale? In other words, why can't Locale be something like:

Well, I believe Variant should be part of the core list, rather than an extension, but except of that, yes, I hope we'll end up with a model similar to that.

In the common case, extendedSubtags is None. Only when you have some of the less common subtags do you trigger the Box to be allocated.

We may not even need to be that binary. Several fields can be an opt-in and should end up costing very close to nothing if not used.

but I don't see why Locale and LanguageIdentifier need to be fundamentally different in their memory footprint.

I also don't see that. I listed two reasons which make me propose that we expose LanguageIdentifier for now, and both of them are fixable. It just takes time. I'd prefer to have something to work with in the meantime, until we have all Extensions APIs more stable and memory/perf in check.

But, do the "many environments" guarantee that they won't add Unicode extension keywords?

Hmm, I have not encountered a reason for font selection to depend on extensions, nor for language resources to be described with them. I can imagine theoretical scenarios where they may, but I hope that by then we'll have Locale ready and be able to switch to it at no cost since it is a strict superset.

I see it as a chicken-and-egg problem. Apps use LanguageIdentifier because operating systems don't put Unicode extensions in their BCP47 tags. But, if operating systems started putting them in, then apps wouldn't be able to interpret them.

Oh, I see. I definitely agree that in many, most even, cases Locale is what should be used. In other words, the primary use of this API should be for the user to request the software to use a certain locale and that should be communicated using Locale API. LanguageIdentifier is not sufficient.

But as I mentioned above, there are use cases where LanguageIdentifier is perfectly usable and likely sufficient forever. Google Chrome or Firefox translation resources are unlikely to be identified by any extensions (not impossible, but quite unlikely), and font decisions will unlikely need extensions to be taken into account. Other examples I can list is that out of the list of 30 different major geopolitical regions that Firefox serves different default combination of search engines for, unlikely any of them would need extensions to be defined, and the negotiation between them and user requested list will unlikely need any extensions to be taken into account.

So it's not always just the chicken and egg. Sometimes its just that the task doesn't involve extensions.

That doesn't mean we need LanguageIdentifier, it just means that for those cases, we should make Locale not require paying for things it doesn't need :)

In my opinion, as a Unicode subcommittee, we need to appreciate the extra responsibility we take on to evangelize for best i18n practices. More people should be using Unicode extensions, and advertising a type that doesn't support them would be a disservice to Unicode, to our clients, and even to end users.

In my previous comment I stated that I disagree with that position. Reading your comment now, I'm more incline to agree that even on the systems level we may and should use our library design to help users and "reward" good choices. For almost all users, even low level, Locale is the right choice, and if we ever release ICU4X with LanguageIdentifier it should be an outlier optimization, and my hope is that by the time we're ready to talk about releases, it will not be needed anymore.

So, to sum up my position: I see LanguageIdentifier as a temporary stepping stone toward Locale, I intend to work on Locale to bring it close to LanguageIdentifier perf/mem and API maturity with the help of the ICU4X WG, and I hope that by the time we are talking about release, we can look back at this issue and decide to remove this stepping stone.

sffc commented 4 years ago

I think this argument could be rephrased to We should use the API to educate the users on how to internationalize their software. I believe that statement to be much more true in the context of ECMA402 than in the context of ICU4X.

The archetype of clients of the "X" phase of ICU4X is not much different than the archetype of clients of ECMA-402. Let's call them "app developers".

Given that we should plan for ICU4X to be on a track for broad consumption by app developers, we should design the API to be in the app developers' best interest. That's not to say that we shouldn't also make it equally useful for system-level clients, of course, like Gecko and Fuchsia. However, I'm a firm believer in an API's responsibility to nudge programmers in a certain direction, but still allowing power users to do things their own way, requiring more verbose code to do so.

Can you elaborate on why do you believe that implementation of Copy is a very important feature?

Since Copy doesn't require a memory allocation, it can be used in more situations, such as being passed across FFI boundaries more easily.

In the context of system software, it seems to me that Locale and LanguageIdentifier are two units that quite fully describe all use cases I can think of, with most user-facing operations requiring Locale, while some lower-level ones being able to use just the LanguageIdentifier (negotiation, font selection, language resource description, etc.).

There we go! A list of use cases for which you claim that a Unicode Language Identifier is sufficient. Let's break those down.

Negotiation: OK. ICU uses the even more restricted set, LSR.

Font Selection: Sure. I think this is based on exemplar characters, which, as far as I know, are currently derived only from the language identifier portion. However, since we're thinking about what could happen, is there be a possibility that the exemplar characters depend on an extension keyword? Consider a theoretical scenario in which Unicode were to introduce an extension keyword that enables or disables the sharp S 'ß' in German. That preference could change the set of fonts that are able to render the text.

Language Resource Description: The canonical example of when a Unicode Language Identifier is the right tool.

My argument for stopping on LanguageIdentifier for now is … I'm definitely a proponent of moving toward Locale … having a stepping-stone toward it seems natural … I listed two reasons which make me propose that we expose LanguageIdentifier for now, and both of them are fixable. It just takes time.

I agree. It's fine as a stepping stone. The question I hope to answer in this issue is what we want to ship in v1. And maybe we can't answer that question until we do more benchmarks on Locale, which is fine. (Although, as mentioned above, I don't think performance should be the deciding factor; the decision should stem from use cases.)

zbraniecki commented 4 years ago

The archetype of clients of the "X" phase of ICU4X is not much different than the archetype of clients of ECMA-402. Let's call them "app developers".

I disagree. I believe they're quite different, use cases are quite different and API consequences are quite different. In particular, ECMA402 is not deprecable, ICU4X is. JS is Garbage Collected, ICU4X is not. If someone is writing software in JS they are much much less likely to operate in a resource constrain environment than some major targets for ICU4X like underpowered wearables etc.

Since Copy doesn't require a memory allocation, it can be used in more situations, such as being passed across FFI boundaries more easily.

We're passing lists of u64/u32 quite successfully between C++ and Rust. I hope it won't be a major issue for lists in ICU4X.

Negotiation: OK. ICU uses the even more restricted set, LSR.

Which is surprising. How does one select available sl-nadiz or ja-JP-macos resources over available sl and ja-JP resources?

Consider a theoretical scenario in which

In such a scenario, if my system (Gecko) decided to start relying on that bit for fonts selection, we would likely update our module to handle Locale rather than LanguageIdentifier so that this bit can be retrieved. Since the user preferred information will already use Locale, we will just keep the full scope when operating on the list of them in font selection. Bottom line - I expect it to be upgradable.

The question I hope to answer in this issue is what we want to ship in v1

Right. I think we accumulated data on where we stand today on this question in this thread and I hope that before v1 we'll have all the data needed to answer it!

macchiati commented 4 years ago

Zibi, you raise reasonable concerns. But the "extended subtags" approach handles the less frequent cases without compromising the functionality. For the frequent cases where we just have LS?R?, it handles the the issue fine. And where someone does pass in variants or extensions, it is for a reason; it is better to pay a little bit of cost rather than toss them on the floor. I have seen, over and over (and over), cases where people took shortcuts in i18n, and regretted it later. As Shane says, when you make it easy for people to do the wrong thing, they do it. It is absolutely fine to heavily optimize the most common cases, but we should allow for the uncommon cases as well.

On Wed, Apr 29, 2020, 23:41 Shane F. Carr notifications@github.com wrote:

The underlying struct TinyStr used for storing of the subtags and extensions is extremely memory efficient and fast for validation, canonicalization and comparisons.

I like TinyStr. Why aren't we using it in Locale? In other words, why can't Locale be something like:

struct Locale { language: TinyStr4, script: TinyStr4, region: TinyStr4, extendedSubtags: Option<Box>, } struct ExtendedSubtags { // variants, Unicode extensions, etc. }

In the common case, extendedSubtags is None. Only when you have some of the less common subtags do you trigger the Box to be allocated.

This may or may not be the optimal subset of subtags (maybe you want to support at least a single variant in the main Locale struct, for example), but I don't see why Locale and LanguageIdentifier need to be fundamentally different in their memory footprint.

This memory is not all used at once, and more often it'll be on the lower end, but the drastic difference, combined with the fact that in many environments LanguageIdentifier contains all of the information provided by the identifier, imho justifies exposing the cheaper struct.

But, do the "many environments" guarantee that they won't add Unicode extension keywords?

I see it as a chicken-and-egg problem. Apps use LanguageIdentifier because operating systems don't put Unicode extensions in their BCP47 tags. But, if operating systems started putting them in, then apps wouldn't be able to interpret them.

In my opinion, as a Unicode subcommittee, we need to appreciate the extra responsibility we take on to evangelize for best i18n practices. More people should be using Unicode extensions, and advertising a type that doesn't support them would be a disservice to Unicode, to our clients, and even to end users.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/unicode-org/icu4x/issues/64#issuecomment-621643853, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACJLEMBNUNN3SJA6HXNLJCLRPEMRVANCNFSM4MVDYM6Q .

sffc commented 4 years ago

I think this issue isn't immediately actionable, so I'm assigning it to "backlog-v1" to revisit before we reach Version 1. I am closing the issue until that time comes.

sffc commented 3 years ago

Because of various reasons over the past year, I've had a change of heart on this issue. I think we should keep LanguageIdentifier and Locale separate. However, we should make sure that we consistently use Locale in all public APIs where we need it. (This was discussed in #492.)