tc39 / proposal-intl-segmenter

Unicode text segmentation for ECMAScript
146 stars 16 forks source link

Custom Dictionaries #133

Open nathanhammond opened 3 years ago

nathanhammond commented 3 years ago

ICU's BreakIterator has clear limitations in its approach for character-based languages without textual word boundaries. When used directly, it allows you to specify a dictionary to work around limitations in its approach, but the Intl.Segmenter API does not expose that functionality. I worry that by standardizing on ICU's BreakIterator approach—without providing the escape valve ICU provides of specifying a custom dictionary—encodes bias into ECMA 402.

For example, ICU's supplied segmentation dictionaries conflate a significant number of languages with distinct usage. This can be particularly fragile across locales. Taiwan, Hong Kong, Singapore, and China all have distinct usage.

(Not to mention languages less-used than the dominant language in those locales.)

I believe that we need to expose custom dictionaries in order to ship this.

sffc commented 3 years ago

ECMA-402 specifically leaves the exact algorithm up to the implementor. Nothing says that engines must use a dictionary-based segmenter. I've heard concerns over the long life of this proposal that there's fear of ICU becoming the de-facto standard if that's what V8 chooses to use, but the proposal has reached Stage 3 despite those concerns. Like all of ECMA-402, Intl.Segmenter is "best effort", and the behavior could change over time or between implementations.

If a client wants to provide their own dictionary, that's a lot of data, and then they may as well ship their own segmentation engine code. The purpose of Intl.Segmenter is to offer a lightweight solution.

nathanhammond commented 3 years ago

Worth mentioning, I am starting from the position of (personally) being entirely unopposed to the standardizing of "just proxy to ICU" in 402. I feel that is an honest description of the state of the world and would bring significant clarity to the effort. My opposition on this particular issue is not at all colored by ICU. My concern here is that I feel like the consequences of this API structurally disadvantage particular languages, including one that I speak (Cantonese).

There are a few implicit statements in your response that I want to make explicit, please correct me if you feel I have mischaracterized your statements (so that I may respond to your argument, not a straw man):

  1. That shipping your own data to the execution environment may have a non-trivial cost and should be considered as a reason for precluding custom data loading.
  2. That implementation differences are permissible and should be found acceptable by end-users (developers).
  3. That implementers may elect not to use ICU to implement this, so specifying a dictionary which is used by ICU's BreakIterator might be irrelevant.

These are all reasonable. My responses:

  1. In a server-based environment for example, outside of initial load, shipping your own data could be relatively cheap. Deciding whether or not a particular use case has drawbacks that should preclude its use is in my opinion beyond the scope of 402.
  2. I fully believe that users will file bugs against implementation differences when those implementations return different results for the same input. The entire saga of IE and Date is a fantastic example of how implementation differences resulted in developer-perceived error even though the behavior was fully within spec. (bterlson can provide lots of color here.) This is the primary thing that informs my opinion of "just proxy to ICU" being an acceptable approach.
  3. Any alternative implementation of segmentation is also likely to structurally disadvantage particular languages unless the API that we direct them toward supporting is inclusive. Any implementation that does not ship detailed support for a particular language will also require an API for the developer to provide data of some format to the underlying segmentation code (maybe it's to provide a language model, maybe it's a dictionary, maybe it's parameters for a pretrained language model, or data that can be used to handle transfer learning of a model).

Restating my concern, I worry that stopping the standardization process for this proposal at this current abstraction ("here is something that has a clear mapping to a downstream API we know most—or even all—of you will use to implement this") codifies into 402 (and subsequently 262) structural advantages for more-dominant languages without the escape valves that exist in those downstream APIs to support other languages.

In summary, my objection here is not over a technical issue, but an equity issue. I do not want to codify inequity into our spec.

(I do not believe that any of the other 402 APIs suffer this same problem.)

zbraniecki commented 3 years ago

Worth mentioning, I am starting from the position of (personally) being entirely unopposed to the standardizing of "just proxy to ICU" in 402. I feel that is an honest description of the state of the world and would bring significant clarity to the effort.

On behalf of Mozilla I would be opposed to such approach. I believe it would be harmful to the ECMA-402 health and purpose.

nathanhammond commented 3 years ago

@zbraniecki I am comfortable with opinions opposed to my position regarding approximate standardization of ICU behavior. I don't intend to litigate that here, and absolutely recognize the merits of designing an approach that is independent from ICU. My opinion on 402's relationship to the ICU is a pragmatic one, not an idealist one. From an idealism perspective, the opinion that you represent for Mozilla matches a shared ideal I also hold. Achieving that goal is harder work, and I am supportive of that effort.

My opinions regarding ICU are merely incidental to my underlying concern: the proposed standardization of an API that leaves structural disadvantages in place for some languages.

We can't ignore the practical implications of an API that will clearly delegate to ICU's BreakIterator in many implementations without addressing the ability to prevent inequity in support for all languages.

Given the back and forth about increasing package size on this topic already, shipping additional bundled dictionaries also seems like it is a non-starter, which is why I am approaching this as adding custom dictionaries.

zbraniecki commented 3 years ago

I share your concern, but I don't think that linking ECMA-402 to ICU is a solution.

My opinion on 402's relationship to the ICU is a pragmatic one, not an idealist one.

My opinion also carries a pragmatic quality.

Given the back and forth about increasing package size on this topic already, shipping additional bundled dictionaries also seems like it is a non-starter, which is why I am approaching this as adding custom dictionaries.

I believe there is a promising avenue to ML driven segmentation models, which may provide better quality at lower bundle size cost.

nathanhammond commented 3 years ago

I agree that a machine learning approach is likely a good fit. That's approximately filed in #134. I should have a working approach for Cantonese segmentation using tensorflow.js soon. (

sffc commented 3 years ago

The Intl.Segmenter constructor takes a locale argument, which implementations should use to tailor segmentation. I therefore don't see how inequity is being codified into this API. For example, the first argument to Intl.Segmenter should differentiate between zh and yue segmentation engines.

nathanhammond commented 3 years ago

I am concerned that you're not taking into account how this will be implemented by implementations and the consequences of that. I do not believe that we get to wash our hands of that.

ICU's BreakIterator is a (necessarily) incomplete effort, but has an escape valve of being able to provide custom dictionaries. To demonstrate the limitations, let's take as an example one of the most common words in all languages: "we." I will be referring to the dictionary provided by ICU.

That is but one example and I can identify many more (but I hope I won't have to in order to convince you).

If we standardize this API as it stands, many implementations will:

  1. Use BreakIterator.
  2. Provide no method to pass a custom dictionary to BreakIterator.
  3. Fail to successfully segment Cantonese without writing and using an entirely separate implementation.
sffc commented 3 years ago

I am concerned that you're not taking into account how this will be implemented by implementations and the consequences of that. I do not believe that we get to wash our hands of that.

This isn't a problem with the Intl.Segmenter spec, and in ECMA-402 it's not unique to Intl.Segmenter, either. Browsers choose which set of locales to ship, which means that some languages have better support than others. i18n advocates, myself included, are in the midst of a multi-year effort to get locale coverage in browsers to scale, which is one of the motivations behind projects such as ICU4X.

To get better Cantonese segmentation support, the solution is to advocate directly to browser vendors.

ICU's BreakIterator is a (necessarily) incomplete effort, but has an escape valve of being able to provide custom dictionaries.

I disagree with the characterization of ICU's option to use a custom dictionary as an "escape valve". It's a low-level constructor for power users to leverage the BreakIterator code without the ICU built-in data.

Even if we were to add a way to override the ICU dictionary in Intl.Segmenter (an idea that I oppose for reasons Zibi stated earlier), most web sites still won't use it. Power users who really want to override Cantonese segmentation can just ship their own Cantonese segmentation code. However, with a certain matter of time (and tireless effort from i18n advocates), this won't be necessary because browsers will be shipping high-quality segmentation engines for all languages by default.

zbraniecki commented 3 years ago

@nathanhammond there's nothing specific to segmentation in your critique. It applies to all I18n APIs and I believe that a solution in form of "allow overrides" is massively complicated and misguided one for the problem at hand.

nathanhammond commented 3 years ago

The Intl.Segmenter API as specified:

  1. Delegates full responsibility for segmentation to the implementer.
  2. Provides no agency to a developer consuming the API.

This specification defines default mechanisms; more sophisticated implementations can and should tailor them for particular locales or environments. - UAX #29 Conformance

Why should 402 not expose mechanisms for tailoring segmentation for particular locales or environments?

Reduced to a challenge, as a developer, how do you propose that I handle segmentation for Cantonese or other languages?

Four paths I can identify:

  1. Lobby for the specification to provide agency to the developer in a way that would allow them to inform implementations of special behaviors needed for a language (or context).
  2. If not at a specification level, lobby each implementer individually to choose to include support for every language you can specify a BCP tag for (the only API surface area).
  3. If not at an implementation level, lobby ICU (as the library to which many implementations will delegate) to include better support for more languages.
  4. Otherwise, manually implement segmentation for any language that the developer needs to support.


  1. This is what I'm doing right now. Y'all seem opposed to coming up with an API that would encourage a path to supporting Cantonese at the specification level.
  2. Lobbying individual implementers does not a scale. a. I haven't convinced you two of the value of trying to find a spec method for supporting Cantonese, why would I expect to have better success at an implementation level? Y'all both are employed by companies that are implementers. b. Inconsistency in implementation and language coverage can leave functionality unused since breadth of support is a significant factor in API adoption. c. It makes no sense to ship bundled support for every possible language, so at some point the answer will be "no" for some language (even if Cantonese is above the bar, something will land below the cutoff). What does this API do for those languages?
  3. Attempting to change ICU has numerous issues: a. Not all implementations will necessarily use ICU. (So we're back to #2.) b. ICU already provides a method to use custom dictionaries, and has no particular reason to bundle additional dictionaries. c. You've noted that implementations do not ship all of the data that exists in ICU or the CLDR. Even if new data is added into ICU there is no guarantee that it will be included. (So why should it not be optional for the developer to re-supply that data?)

As a JavaScript developer my best option becomes to manually implement support since the API is not reliable for enough languages. In that scenario, shipping this specification provides extremely limited value to me as a developer: it specifies an API shape to target, but that could be aped from ICU anyway.

nathanhammond commented 3 years ago

As to social considerations, there are also a number.

Shipping this API as specified will lead many developers who are unfamiliar with languages to believe that they have a valid segmenter. The API over-promises and (for any foreseeable future) under-delivers. A developer will have no means to know that they're completely failing to segment yue when passing that in as a tag to the API.

I believe that there is a serious equity concern that should be addressed during the consideration of i18n APIs. I do not believe that this one meets the bar. That this may have been unconsidered in previous standardization efforts does not absolve this API from needing to consider it. It can be argued that this API doesn't write inequity into it, but the actual outcomes from standardizing this API will instead simply demonstrate the existing structural disadvantages for minority languages. Standardization without having a clear path to support any arbitrary language communicates to the community of speakers around the world that their language is lesser deserving of support.

And finally, concretely, the Intl.Segmenter API as delegated to ICU effectively standardizes on Mandarin, a language that has been wielded by those in positions of power explicitly to supplant languages such as Shanghainese, Taiwanese, Cantonese, and more. That this API disproportionately affects languages in this category should give us reason for pause.

nathanhammond commented 3 years ago

there's nothing specific to segmentation in your critique

I proposed a concrete solution (custom dictionaries) to a known limitation that exists in BreakIterator, which many implementations will delegate to. Without that method, any implementations that delegate to ICU will further disadvantage non-Mandarin Sinitic languages (Mandarin is already advantaged by default inclusion, providing no method of parity is not one step worse, it's two). My objection is primarily to barrelling forward while ignoring that very real limitation.

Even anyone that uses some future ML model approach will also need to be able to pass information in (see #134).

This goes back to my pragmatism vs. idealism opinions. I believe that we should specify an API for now, not for the future. Or, we can delay the specification until the future arrives.

sffc commented 3 years ago

First of all, thank you for your advocacy for i18n equity. I think it's a problem that's too often overlooked and doesn't receive the attention it deserves on an organizational level.

I want to make clear that I'm on your side, and Zibi is as well: we both want to see the Web platform support more locales and close the gap between majority and minority languages.

I see now that you're saying Intl.Segmenter is different from other Intl APIs in that it structurally disadvantages Cantonese and other Chinese-script languages that aren't Mandarin, which is an i18n equity problem of unique importance because of the historical, political, and cultural context. This is a good point that I will raise at the next TC39-TG2 meeting.

To answer your specific queries and assertions:

Lobby for the specification to provide agency to the developer in a way that would allow them to inform implementations of special behaviors needed for a language (or context). ... This is what I'm doing right now. Y'all seem opposed to coming up with an API that would encourage a path to supporting Cantonese at the specification level.

All Intl APIs, including Intl.Segmenter, have the following escape hatch:

Intl.Segmenter.supportedLocalesOf(["yue", "zh"])

The expected behavior is that if the developer wants to support "yue", they should call that function, and download a polyfill for "yue" (such as the one you've been working on) if it's not supported by the implementation.

If not at a specification level, lobby each implementer individually to choose to include support for every language you can specify a BCP tag for (the only API surface area). ... a. I haven't convinced you two of the value of trying to find a spec method for supporting Cantonese, why would I expect to have better success at an implementation level? Y'all both are employed by companies that are implementers.

I do think there is value in "trying to find a spec method for supporting Cantonese". I'm saying that we already have it: the locale argument (to hint the implementation) and supportedLocalesOf (to enable polyfillability).

b. Inconsistency in implementation and language coverage can leave functionality unused since breadth of support is a significant factor in API adoption.

Intl doesn't throw exceptions for unsupported locales in large part because we see ourselves as "best-effort i18n" and want to allow implementations to differ in their breadth of support. It's already the case that different browsers support different sets of locales. The developer should pass "yue" into Intl APIs, and the browser will do its best so support your request.

c. It makes no sense to ship bundled support for every possible language, so at some point the answer will be "no" for some language (even if Cantonese is above the bar, something will land below the cutoff). What does this API do for those languages?

The most likely future is one with "language packs", where we can scale the browser to support hundreds of locales. (The exact delivery mechanism for those language packs is a continuing discussion that hasn't been resolved yet.) I've also been an advocate for introducing async APIs into Intl (see to allow implementations and polyfills to download new locale data on demand.

As a JavaScript developer my best option becomes to manually implement support since the API is not reliable for enough languages. In that scenario, shipping this specification provides extremely limited value to me as a developer: it specifies an API shape to target, but that could be aped from ICU anyway.

As a JavaScript developer, you should use Intl.Segmenter when it supports the current user's locale, and download an Intl.Segmenter polyfill when it doesn't.

Shipping this API as specified will lead many developers who are unfamiliar with languages to believe that they have a valid segmenter.

You're correct on this point. It's a problem of Intl.Segmenter, and it's also a problem shared by all of Intl. It's the cost-benefit tradeoff of the principle of "best-effort i18n" discussed above.

The API over-promises and (for any foreseeable future) under-delivers.

I hope that the "forseeable future" turns out to be a fairly short period of time, like several quarters as opposed to several years.

A developer will have no means to know that they're completely failing to segment yue when passing that in as a tag to the API.

The astute developer can use supportedLocalesOf.

I believe that there is a serious equity concern that should be addressed during the consideration of i18n APIs. I do not believe that this one meets the bar. That this may have been unconsidered in previous standardization efforts does not absolve this API from needing to consider it. It can be argued that this API doesn't write inequity into it, but the actual outcomes from standardizing this API will instead simply demonstrate the existing structural disadvantages for minority languages. Standardization without having a clear path to support any arbitrary language communicates to the community of speakers around the world that their language is lesser deserving of support.

I strongly agree that i18n equity needs more focus and attention.

Intl makes it easy for developers to add i18n support to their web page instead of having the web page be English-only. So, in effect, without Intl, we have "English dominance", and with Intl, we have "Tier 1 language dominance". I argue that's a step in the right direction.

I want to see minority languages be just as well supported as Tier 1 languages, and there are many people in this space who share this desire (attend the Internationalization and Unicode Conference to meet some of them). I just firmly see this as a problem on the implementation side, not on the spec side.

As a side note, I'm personally inspired that you applied the word "equity" to this situation. "i18n equity" is a much more pointed and timely term than other terms used to describe this problem space, such as "long-tail language support" or "next billion users". I intend to start using "i18n equity" when advocating for solutions to this problem with others in my organization.

And finally, concretely, the Intl.Segmenter API as delegated to ICU effectively standardizes on Mandarin, a language that has been wielded by those in positions of power explicitly to supplant languages such as Shanghainese, Taiwanese, Cantonese, and more. That this API disproportionately affects languages in this category should give us reason for pause.

Point taken.

I proposed a concrete solution (custom dictionaries) to a known limitation that exists in BreakIterator, which many implementations will delegate to. Without that method, any implementations that delegate to ICU will further disadvantage non-Mandarin Sinitic languages (Mandarin is already advantaged by default inclusion, providing no method of parity is not one step worse, it's two). My objection is primarily to barrelling forward while ignoring that very real limitation.

Does supportedLocalesOf solve the problem?

Even anyone that uses some future ML model approach will also need to be able to pass information in (see #134).

Can you elaborate?

Note that UTS 35 already defines some Unicode extension subtags that allow for tailoring the segmentation engine, such as "dx", "lb", "lw", and "ss", and more such subtags can be added in the future.

This goes back to my pragmatism vs. idealism opinions. I believe that we should specify an API for now, not for the future. Or, we can delay the specification until the future arrives.

I will raise this point to TC39-TG2.

nathanhammond commented 3 years ago

Informational response, with a further response later, to demonstrate why I consider this impossible to get right with just language tags.

Intl.Segmenter.supportedLocalesOf(["yue", "zh"]), even if it just delegates to ICU with the existing dictionary, could be considered to validly return ["yue", "zh"]. Cantonese is a diglossic language with separate written and spoken forms. It has a written form whose similarity to Mandarin can be remarkably high. They're both connected via "Standard Written Chinese" though many Cantonese (and other Sinitic language) speakers object to the use of "standard" as the written forms can also be distinct enough such that they're mutually unintelligible. Those that would be mutually unintelligble would not be segmented correctly using the existing dictionary. "Mutually intelligible" is likely best measured by "distance from Mandarin."

Concretely, these are both valid written Cantonese, meaning "We don't want to eat.":

  1. 我哋唔想食飯。(Informal spoken form, directly serialized, which would be used in chat situations, social media, some subtitles, recently written literature, and more. In wide use on the Internet.) Manual segmentation: ["我哋", "唔想", "食煩", "。"]
  2. 我們不想吃飯。(Informal written form, would almost never be spoken outside of an educational setting, valid to use in almost any written scenario, "Standard Written Chinese.") Manual segmentation: ["我們", "不想", "吃飯", "。"]

ICU BreakIterator with the existing dictionary will completely fail to segment the first and will segment the second one correctly because of the relationship to "Standard Written Chinese."

And then it gets harder:

Character Set

Language Family

"Yue", or "粵語", isn't really a language so much as it is a language family of which Cantonese is the most-well-known member. "广东话" (simplified) ("廣東話" traditional) is translated as "Cantonese" for which a direct literal translation literally means "spoken language of the Canton region." An approximate equivalent as an explanation for "Yue" might be to call English a "Germanic" language. "Yue" as a classifier would also include many historical languages and other still-in-use languages under its umbrella such as Taishanese (台山話). Taishanese is not only in wide use in Taishan, it is heard on the streets of Chinatown in San Francisco and throughout the world because of emigration patterns.


Cantonese as spoken in Guangdong is easily differentiable from Cantonese as spoken in Hong Kong due to English influence and both past and present colonial history. The lines drawn on a map have resulted in dividing the language development into distinct paths on the opposite sides of the border. Macau adds Portugeuse into the mix, again for colonial history reasons.

Historic Encodings

Some digital Chinese uses known-incorrect but homophonic characters because of previous inequity in available character encodings. Addition of characters to Unicode necessarily occurs after a new character has come into use. The latency between creation of a character and insertion into Unicode character tables outside of private use areas of individual fonts requires this workaround. The larger that latency, the more content is produced with these workarounds—sometimes entirely subsuming the "true" character. This can easily mask the true intended word for any segmenter, even though the word itself may appear to be nonsense.

Censorship Circumvention

Many Chinese characters are composed of multiple components. Those components can be exploded into a series of individual characters which, when read by a person who understands the "code," will carry a separate meaning. A non-tailored segmenter shouldn't be expected to understand this, but it should be possible to create a segmenter which can identify many of these occasions (as demonstrated by the success of detecting this method by censors).

Use of homophones is also a pattern used for censorship circumvention.


Because that isn't enough:

So, in order to properly tailor segmentation, at least some portion of this needs to be accounted for. Many of these things can be specified in a language tag, but eventually the degree of specificity required in the language tag becomes extreme, verging on impossible.

I fully believe that segmentation for Character-based languages requires the equivalent of a focused Hunspell-like project, a project that I've heard was required to be created in order to support spellchecking for Hungarian. It is currently doctoral-thesis level work for linguists.

A dictionary approach is a decent first approximation, and is why a tailored BreakIterator, though imperfect, can still perform remarkably well.

Until I'm done helping my Cantonese instructor with his doctoral thesis (perhaps a Hunspell-like project for Cantonese) Jieba (which also uses a dictionary approach for all but unknown words) will likely be the best option.

All that to say: I do not believe that we can expect to be successful in providing an out-of-the-box solution without providing a tailoring API, whether provision of a dictionary or other methodology. That the state of the art for written character-based languages also uses a dictionary means that we shouldn't assume that we will be able to improve on that without evidence in hand in advance.

FrankYFTang commented 3 years ago

There is a lot of discussion in this thread and be honest I have not read all of the details. Just want to pointing out several details

  1. The ECMA402 API itself does not mandate a particular dictionary. In particular, v8 / Chrome will ship a different Khmer dictionary than the standard ICU one. We did this for years already and that won't violate the ECMA402 standard.
  2. Extensibility not always need to be perform by passing a function, it could be done on top of the API, there are no reason a JS developer cannot implement a JS class to mimic the same API and delegate part of the text to the default implementation and perform different segmentation ON TOP of the Intl.Segmenter. For example, a JS developer can add a layer on top to handle only ASCII + Han script by regex and perform it's own segmentation for these two script and delegate the process of other text (such as Thai, Khmer, Lao, or Arabic) to Intl.Segmenter to perform.
  3. The ECMA402 API is not bind to a dictionary approach- For example, I am working on a possible approach to use no dictionary but LSTM (see and ) to replace the implementation. I will oppose any API proposal which define or mandate a particular dictionary format and reject the implementation which chose approaches which not to use any dictionary at all.

All that to say: I do not believe that we can expect to be successful in providing an out-of-the-box solution without providing a tailoring API, whether provision of a dictionary or other methodology. Could you be explicit when you mention TAILORING, TAILORING in where? 1) in JS code by web developers? 2) in implementation , by the JS engine developers?

You can do BOTH 1) and 2) now. 1) can be done by adding code on top of the Intl.Segmenter 2) can be done by the developer using ICU or not using ICU.

macchiati commented 3 years ago

I agree with Shane's comments.

The model is to provide the service for the best match to what is available, AND provide a way for developers to query if what they get is what they expect. So any developer can check whether a service is available for yue, or en-AU, etc.

OT: I also agree that 'i18n equity' needs more focus and attention, although I'm not sure if that is the best term. There are a large number of languages (~7,000) with a long tail. And I don't see people ever spending the same amount of work to support, say,, as to support a sizable language like 'yue'. An achievable goal for the 'digitally disadvantaged languages' would be to enable at least display, input, and locale selection on major platforms.

zbraniecki commented 3 years ago

OT: I also agree that 'i18n equity' needs more focus and attention, although I'm not sure if that is the best term. There are a large number of languages (~7,000) with a long tail. And I don't see people ever spending the same amount of work to support, say,, as to support a sizable language like 'yue'. An achievable goal for the 'digitally disadvantaged languages' would be to enable at least display, input, and locale selection on major platforms.

My understanding is that equity is exactly that (compared to equality which would indicate equal treatment of all locales). The concept of i18n equity would be therefore to apply proportional resources to ensure sufficient coverage and experience for users of all locales, but preserving the recognition of - not every locale will get the same amount of attention.

macchiati commented 3 years ago

I think the term equity could be understood in many different ways, and that's why it's a slippery term. Your 'proportional i18n equity' is clearer (though not as pithy).

FrankYFTang commented 3 years ago
  • In a spoken form for Cantonese that would be serialized to "我哋", which you will not find present in the dictionary. In fact, you won't even find that second character (哋) in the dictionary at all.

There are nothing in the ECMA402 mandate that cannot happen and / or guarantee that won't happen in the future. There are neither a reason, based on the text of ECMA402, to prevent either ICU or anyone using ICU or anyone not using ICU to add such into the ICU (or other) dictionary.

FrankYFTang commented 3 years ago

BTW, the reason ICU's dictionary didn't handle Cantonese is very simple- no body try to and there are no strong reason attempt to yet. There are many dialects in China has very similar condition as Cantonese, for example Wu (or Shanghainese), MingNan ( or Taiwanese) , all have not enough text in written form online as training material. From my point of view, none of them need a "tailoring" approach and can be simply addressed by appending into the cj dictionary in ICU if someone bother to support them. Tailoring is only needed if there would be a conflict between two different dialect, but so far the issue is not conflict between, but just lack of support.

nathanhammond commented 3 years ago

There is a lot of discussion in this thread and be honest I have not read all of the details.

Please, y'all, do take the time to review when you have time. I've spent the time to try and explain my concerns for the committee's consideration. I have a full time job that is not in tech at this point and will respond as I have opportunities. For example, this response will not address all comments since my last, just @FrankYFTang's first one.

The ECMA402 API itself does not mandate a particular dictionary. In particular, v8 / Chrome will ship a different Khmer dictionary than the standard ICU one.

I agree that this is fine, but is tangential to my concerns; my concern is that we don't have a path for convincing implementations to include Cantonese, or a path for convincing ICU to include Cantonese, or have a planned strategy for "language packs" or other just-in-time loading, or stated most generally, "how doesthis API actually support other languages?" There are already numerous complaints about the size of the existing data increase just for the current dictionaries (approaching 4 megabytes) and any further extension of ICU to support Taiwanese, Shanghainese, Cantonese, or any other language which should be possible to address within the bounds of the existing BreakIterator API would result in further significant increases.

Extensibility not always need to be perform by passing a function, it could be done on top of the API, there are no reason a JS developer cannot implement a JS class to mimic the same API and delegate part of the text to the default implementation and perform different segmentation ON TOP of the Intl.Segmenter.

I would argue that taking this approach makes this not an API, but instead an Interface. All Promises are thenables but not all thenables are Promises. If we're defining an API we should ensure that the API can meet the needs of every language we wish to support. If we're instead defining an Interface, that feels to me like library code—not a language API specification to include in Ecma 402. I'm happy with the Interface as specified, having already implemented it once myself to support the language I speak at home. I'm unhappy with this as an API, because I can't foresee how it would support the language I speak at home.

The ECMA402 API is not bind to a dictionary approach- For example, I am working on a possible approach to use no dictionary but LSTM (see and unicode-org/icu#1529 ) to replace the implementation.

I agree that this specification, as proposed, is not bound to a dictionary approach. But I don't believe that we can ignore how implementations will implement it and the limitations that would impose on consumers of the API.

Given the constraints, we can say pretty concretely that V8 would (at least in V1) provide no method to support Cantonese (as a concrete example).

Further, it sure seems like a lot of people in this thread are looking at alternative implementations that may provide better results across a large number of languages. Given that all of those are open research projects at this point it might be prudent to wait for results from those first so that we're not specifying the next ApplicationCache.

I will oppose any API proposal which define or mandate a particular dictionary format and reject the implementation which chose approaches which not to use any dictionary at all.

My approach has been very focused on attempting to solve for problems that I can identify. I started with a very explicit proposal (custom dictionaries) but am excited to explore any proposed solution that would elegantly support minority languages. At this point I'm not looking to make broad sweeping statements on what we should or should not specify as that unnecessarily constrains our available solution space.

Could you be explicit when you mention TAILORING, TAILORING in where?

  1. in JS code by web developers?
  2. in implementation , by the JS engine developers?

You can do BOTH 1) and 2) now.

  1. can be done by adding code on top of the Intl.Segmenter
  2. can be done by the developer using ICU or not using ICU.
  1. Tailoring within JavaScript is distinctly inequitable. Some languages will be privileged and work out of the box with no additional data to load while others will require a completely separate, community-supported, library-code implementation. I feel like this approach is no longer providing an API for segmentation, it specifying an Interface for a Segmenter class. If this is the proposed solution, I propose that this go into library code instead of being encoded forever based upon what we can identify in advance.
  2. I'm wary of an approach that would require tailoring by implementers as convincing each implementer individually or targeting upstream library code (ICU) that those implementations delegate to is going to be very hard. (See the feedback on this issue to see exactly how difficult it would likely be to convince everybody.) ICU exposes the ability to provide custom dictionaries in a way that specifically sidesteps my concerns about equity. Though it does not bundle Cantonese in cjdict, it also does not prevent a consumer of ICU from being able to support Cantonese. Consumers of ICU as library code, however, would need to either create something that allows JS developers to punch down to the ICU API for providing custom dictionaries, or provide additional data themselves (which circles back to the data size problem).
  3. I mentioned attempting to handle this at the spec level so that we could (in advance of implementers) consider how we can support arbitrary languages, and enable this proposal to function as an API. Some future non-ICU approach might use a transfer learning model or could have tuning parameters or supply a small training set of additional hyper-specific vocabulary in order to get good segmentation for some languages or niches. We don't have any idea what that might look like at this time. I'm comfortable delaying until we know what to do.

Worth noting, the last meaningful update to cjdict happened in 2012, removing of CC-CEDICT data. That happened very shortly after the original implementation landed in August 2012.

Also worth noting, the initial goal of supporting character-based languages appears to have begun in August 2002, and landed 10 years later in August 2012. This informs my opinions about designing an API for now, or delaying for research projects to be complete. It also demonstrates the historic inequity that even Mandarin has faced. As such, China had to mandate GB18030 character set support in order to sell software into China—in 2001.

nathanhammond commented 3 years ago

The model is to provide the service for the best match to what is available, AND provide a way for developers to query if what they get is what they expect. So any developer can check whether a service is available for yue, or en-AU, etc.

I tried to explain why I don't believe that scales here.

An achievable goal for the 'digitally disadvantaged languages' would be to enable at least display, input, and locale selection on major platforms. - @macchiati

Cantonese is somewhere around the 15-20th most-spoken language in the world. I'm fully capable of building out a complete set of Cantonese support and at least attempting to land the code in every single possible place. But even if I were to do that, I have no guarantees that any 402 implementation is going to ship the data I need to make that work, or the code to make it happen. And that is for a top-20 language.

Literally the first comment on the Firefox implementation of this proposal: "increases icudt by 3.57MB which makes it kind of unlikely to be approved by release drivers". Not promising for being able to get my implementations released. (At some point the answer is "no," even if the answer is "yes" for Cantonese. What is our design when we have to tell language ranked 5002 by usage, "no, you're not going to be default included"?)

nathanhammond commented 3 years ago

Since I may have accidentally helped coin a term, my intent behind using the word equity was to encourage setting our target in a way that the following statement is true: "we believe it is possible to achieve high-quality support for your language within this API." I don't mean that just hypothetically, but also taking into account externalities such as binary size limitations, existing library APIs that would be delegated to, BCP-47 expressiveness limitations, and more.

I'm very much not saying that we must have everything addressed for every language, but more that if a language has an individual champion who is willing to take it upon themselves (or as a part of a team, or in a government sponsored effort), the equitable ideal would be that it is possible to provide support equivalent in quality to that which English (as technology's lingua franca) receives. It should not require market-barrier-enforced mandates to achieve equity.

And sure, we will have blind spots, we will get things wrong. But when somebody points out the places where we have failed to consider something and thus failed to meet that bar, we can work with them to figure out how to address those shortcomings.

That each additional language supported comes with a cost to non-users is something I'm well aware of. We should definitely be paying attention to design in that space. (ICU4X has a solid premise.)

FrankYFTang commented 3 years ago

Cantonese is somewhere around the 15-20th most-spoken language in the world.

not as in "written form", right? For example, there are only 107,213 articles in and that is probably one of the biggest site you can find Cantonese web pages. (I would love to know if there are some other site which have more )

BTW, what do you think the developer will use this for?

sffc commented 3 years ago

Since I may have accidentally helped coin a term, my intent behind using the word equity was to encourage setting our target in a way that the following statement is true: "we believe it is possible to achieve high-quality support for your language within this API." I don't mean that just hypothetically, but also taking into account externalities such as binary size limitations, existing library APIs that would be delegated to, BCP-47 expressiveness limitations, and more.

The payload problem is a major point of discussion in ECMA-402; it's how we arrived at the revised Stage 2 and Stage 3 entrance requirements (see our presentation on this subject from last month here). Intl.Segmenter has gotten this far because browser vendors agreed that the improvement to the web ecosystem was worth the payload. In other words, the cost of adding this feature has already been taken into account. Browser implementors, like Frank and Zibi, are fully aware of the tradeoffs. I'm also confident in them to adopt the community's recommendations on how to improve Cantonese segmentation and pull that into their respective browser engines when it becomes available.

As far as BCP-47 expressiveness, Unicode locale extensions are constantly evolving. New Unicode extension keywords are frequently added to UTS 35. Segmentation and collation are two key use cases for UTS 35, so if there is something that UTS 35 doesn't support, I'm confident that a proposal for that addition would be taken seriously.

To revisit the original suggestion, which was to add an option to tailor the dictionary: I understand the mental model behind this suggestion, but I haven't heard an answer to the question of why adding such a tailoring option is better than the status quo of checking Intl.Segmenter.supportedLocalesOf and downloading a polyfill if the browser doesn't support the desired language. The existence of the dictionary tailoring option won't cause any substantial number of developers to choose to support Cantonese any more than Intl.Segmenter.supportedLocalesOf would. In other words, adding a dictionary tailoring option doesn't solve the i18n equity problem.

nathanhammond commented 3 years ago

Cantonese is somewhere around the 15-20th most-spoken language in the world.

not as in "written form", right? For example, there are only 107,213 articles in and that is probably one of the biggest site you can find Cantonese web pages. (I would love to know if there are some other site which have more )

BTW, what do you think the developer will use this for? - @FrankYFTang

The public websites where I suspect you'll find the most written Cantonese are these two, they're both effectively Reddit clones:

However, the primary use for this is very clearly for private audiences that won't show up in web metrics like page counts:

  1. Facebook and other social networks, where you would communicate with your friends as if you're having a conversation. Using 口語 (literally, "mouth language," spoken language) as a way of reducing formality.
  2. Chat applications. The vast majority of written communication in this world is one-to-one or one-to-few. On my computer, running right now, I have four Electron applications: Discord, Signal, WhatsApp, & (Facebook) Messenger.
  3. Email. Much email composition is done inside a web environment.

Those three environments combine for a tremendous portion of web use by time—maybe even a majority. Every single one of those would be improved by having the ability to accurately segment more languages, Cantonese included. (Facebook literally implemented custom cursor behavior for theirs—it sure would be nice to have reasonable word jumps when authoring.)

Some use cases, enumerated: cursor navigation, spell check, grammar check, teaching tools (one of the big reasons I care), better linebreaking (it's currently awful), improved voice-to-text (a naive version could get much farther), and more. Let me know if I haven't listed enough, I can come back with a longer list to make sure everybody is satisfied—I mean that seriously, not sarcastically or passive aggressively; I'll build the inventory of use cases. Further, simply making some of the pain go away also goes a long way toward increasing the use of a language.

But I also wonder why I have to define these needs for Cantonese where the existence of a use case for any other majority language is assumed. This again demonstrates the bias in favor of more-dominant languages that I've been pointing out in this thread. Beyond that, 402 explicitly didn't constrain itself to Web or even JS environments. So by focusing exclusively on those we are already putting on blinders.

(Aside: @FrankYFTang I am guessing that you might can read characters? 簡體字定係繁體字啊?And yes, I chose to ask in Cantonese intentionally. 😜)

nathanhammond commented 3 years ago

Let me also put some additional color into this thread since today especially I'm emotionally exhausted.

I live in Hong Kong. Today, 47 people who participated in an election where I worked at a polling station were jailed for having a dissenting opinion. On election day (July 11, 2020) I shook the hands or had conversations with many of these people. Now all of them are in jail for the indefinite future. The headline, from Washington Post: With new mass detentions, every prominent Hong Kong activist is either in jail or exile.

Or, more personally, after a little personal protest that I alone staged, these are three of the security guards who paid me a visit in the classroom where I study Cantonese:

(Check out the background of the second picture to see teaching materials pinned to the wall.)

And yet, I'm just an immigrant here, privileged by my skin color, passport country, and status in Hong Kong such that I'm somewhat insulated from the worst consequences that many of my friends and family will suffer.

When you're small and feel helpless against a giant machine, all you want to do is find some way where you may be able to improve the situation, some place where you think you can maybe affect change.

So I'm here, writing in this thread. I'm writing code that can be used to support usage of Cantonese: And I'm trying to inventory the technical things that I could do, participate in, here:

Now it's 2am and I need to sleep. But even tonight I had to do something to be able to feel not-so-helpless. Tomorrow morning I return to my Cantonese coursework at 9:30am. In a month I'll have a degree, but that's not why I'm studying. At about that same time I'll have a daughter, and will begin teaching her Cantonese so that she isn't cut off from the world she comes from. And you can be damn sure that any barrier I can remove to her accessing the culture of her family, friends, and history is something that I will fight to remove. That too is another reason I am here.

So, when I say that I feel like this API is wanting, these are the things that are on my mind. When I say that this API is inequitable, this is why I care.

Y'all are approaching this from a technical perspective which I can respect, but we write code for humans, in service of humans. From my perspective, this API can only serve some humans, and then everybody else is at the mercy of how much the end-developer cares (or knows). The counterproposal repeatedly offered in this thread is to "load a polyfill"—which doesn't use this API so much as replace it, and offload the problem to the end-developer. This makes it self-evident to me that the API doesn't solve the problem, and that this API needs additional consideration.

We need to find a solution to this problem as deep into the priority of constituencies as possible, in order to make it available to as many people as possible.

And that's what I want us to address.

nathanhammond commented 3 years ago

My proposal in this particular thread isn't one I'm particularly in love with, but was concrete enough to serve as a straw man. Don't ask me to defend it too strongly; I'm not willing to do so. But it serves to demonstrate the inequity.

@sffc I will reply to your note later.

sffc commented 3 years ago

Discussion from 2021-03-11 TC39-TG2 meeting:

Procedurally: Seeing that Intl.Segmenter is approaching Stage 4, and is already shipped in Chrome and Safari, we should move discussions like this one to the main ECMA-402 project to serve as a basis for a future proposal. I would be happy to entertain concrete proposals from @nathanhammond on this subject.

nathanhammond commented 3 years ago

HK urged to consider simplified Chinese and Mandarin

Beijing's Ministry of Education on Wednesday suggested Hong Kong clarify the status of simplified Chinese and Mandarin in law, and for students here to learn Mandarin under a system in which the language is incorporated into the local exam system.

FrankYFTang commented 3 years ago

Nathan- Feel free to comment in If you have the legal right to contribute the list of Cantonese words I will work with you to make a prototype for that.