unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.32k stars 170 forks source link

Should the Segmenter types accept a locale? #3284

Open sffc opened 1 year ago

sffc commented 1 year ago

In the API review, @markusicu pointed out that ICU takes a locale in the segmenter, and the locale affects the behavior in certain cases, such as those in the data files below:

Why don't we support these in ICU4X Segmenter, and should we add them?

For 1.2 purposes, we have a few choices:

  1. Add the locale parameter now and don't use it for anything yet
  2. Don't add the locale parameter but add something like _invariant to the constructor names, so that in the future try_new_auto_invariant() creates the locale-invariant segmenter and try_new_auto(locale!("el")) creates the locale-specific segmenter
  3. Keep things the way they are and add locale constructors later, possibly adopting the style above in 2.0
  4. Add the parameter to Word and Sentence, but not Line or Grapheme

Thoughts?

@aethanyc @makotokato @Manishearth

Manishearth commented 1 year ago

General preference for #3

Do we plan to provide these locale-ish APIs in the near term? I actually think future try_new_auto_with_locale() would be fine

zbraniecki commented 1 year ago

General preference for #2

I'd prefer the default constructor names to be consistent in behavior as much as makes sense. If we believe Segmenter constructors will want to take locale just like all others, lets keep the names for those constructors. If, in the future, we decide to not add those names, we can always alias the default constructor names to the _invariant ones.

makotokato commented 1 year ago

Why this ICU4C rule isn't merged/requested to UAX#29? Does ICU4C have a plan to file/merge an issue to UAX#29? After merging this change to UAX#29, then #3.

Manishearth commented 1 year ago

UAX #29 in general doesn't really want to include locale-specific stuff because it wants to leave that up to CLDR.

sffc commented 1 year ago

Suffix suggestions:

macchiati commented 1 year ago

Since we will want to take locales as parameters (even for segmenters where that isn't implemented yet), IMO we should make that the "normal" case.

sffc commented 1 year ago

From discussion with @aethanyc @makotokato @Manishearth @nordzilla: It is an enhancement to consume CLDR root.xml tailorings, but not necessarily a bug. We would like to see it done in a timely fashion.

Conclusion: use _invariant

robertbastian commented 1 year ago

For the record we didn't use _invariant in #3294

hsivonen commented 11 months ago

Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr

These seem to be lists of abbreviations that contain a period that doesn't end a sentence. How bad would it be to merge the lists and use the merged lists across languages?

fi_sv override for word break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/word_fi_sv.txt

It's a bit sad that treating letter, colon, letter as having a word break opportunity after the colon is a case of giving computer syntax needs precedence over natural-language needs. If accommodating computer syntaxes wasn't given priority, the Finnish/Swedish requirement of not treating letter, colon, letter as containing a word break opportunity could be hoisted to root.

el override for sentence break: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/rules/sent_el.txt

This seems to be about ASCII semicolon having sentence-ending question mark semantics. Could this be accommodated in the root by triggering the rule on the most recent letter being from the Greek script?

robertbastian commented 7 months ago

It's a bit sad that treating letter, colon, letter as having a word break opportunity after the colon is a case of giving computer syntax needs precedence over natural-language needs. If accommodating computer syntaxes wasn't given priority, the Finnish/Swedish requirement of not treating letter, colon, letter as containing a word break opportunity could be hoisted to root.

I think German needs this tailoring as well. I don't know why Finnish and Swedish do, but in German a colon is commonly used to form gender-neutral nouns, like Lehrer:in, which should not contain any word breaks.

What's the current process for updating the tailorings? ICU or CLDR?

Manishearth commented 7 months ago

@robertbastian this was recently discussed in the CLDR design meeting, CLDR issue: https://unicode-org.atlassian.net/browse/CLDR-15910 / PAG issue: https://github.com/unicode-org/properties/issues/187 (internal)

There's thought that this should actually be made to apply to all languages, since colons without spaces on either side are not really a thing in regular text anyway, and if the space has been removed there's a good chance it's on purpose.

srl295 commented 7 months ago

Data bundles that contain some language-specific data for sentence segmentation: https://github.com/unicode-org/icu/tree/main/icu4c/source/data/brkitr

These seem to be lists of abbreviations that contain a period that doesn't end a sentence. How bad would it be to merge the lists and use the merged lists across languages?

That's exactly what they are. https://www.unicode.org/reports/tr35/tr35-general.html#Segmentation_Exceptions for more details.

The lists for one language may not be applicable for others. But you could probably calculate a list that's likely to be generally useful, it might be less useful for any particular language.

hsivonen commented 7 months ago

I think German needs this tailoring as well.

I think https://unicode-org.atlassian.net/browse/CLDR-15910 should be reverted on the root level so that we don't need tailorings to accommodate natural languages.

I don't know why Finnish and Swedish do

For Finnish, the use case is marking where a sufficiently unusual word body (e.g. acronym) ends and the case suffix starts. For example, English Henri’s would be Henrin in Finnish but English ICU4X’s would be ICU4X:n in Finnish. The use case for Swedish seems to be also about applying suffixes (though not case suffixes) to sufficiently unusual word bodies. (Consider an analog English Londoner but with with the suffix applied to e.g. a sports team acronym.)

sffc commented 7 months ago

Okay, for 2.0 purposes, which of the four segmenters requires a locale parameter?

  1. Grapheme: Are there locale-specific CLDR tailorings for graphemes?
  2. Word: It sounds like people want to move the fi_sv tailorings to the root, which would obviate the need for RBBI tailoring. However, locale info could still help with complex language segmentation, although we need to know the language of the text, not of the user.
  3. Sentence: Seems like this is the biggest use case, although it is still about the language of the text and not of the user.
  4. Line: I think this one is invariant.

The "language of the text" would be more appropriate to provide in the terminal segment function since it is an attribute of the text, but since that requires data loading, it might be more appropriate to specify it in the constructor. Alternatively, we could stuff all sentence language tailorings into a single data key which is always loaded when making a sentence segmenter, as we do for the word break segmenter.

srl295 commented 6 months ago

Grapheme: Are there locale-specific CLDR tailorings for graphemes?

Not yet. Please put it into the API. I was doing planning on a work item to move this forward. This is for example languages that want to keep "ch" together etc.

hsivonen commented 6 months ago

Grapheme

Please put it into the API.

On the flip side, putting this in the API really requires making ECMA-402 have a way to explicitly ask for root and to default to root.

Some users getting a different definition of extended grapheme clusters based on the browser UI locale would likely be bad, after developers having assumed for years that extended grapheme clusters are a Unicode-level concept and not a locale-level concept. Also, it would be bad to have to assume that English is always going to be the untailored language and to teach every developer to ask for a grapheme segmenter for English in order to get behavior on a similar level of stability that one would expect of e.g Swift strings.

This is for example languages that want to keep "ch" together etc.

What languages do you mean and why do they want to keep "ch" together for the kind of purposes that extended grapheme clusters are used for, such as denying the selection of only "c" or only "h"? Czech treats "ch" as a collation unit, but do users of the language expect not to be able to select "c" and "h" individually?

zbraniecki commented 6 months ago

We (poles and chechs) expect to be able to select each letter separately but would really prefer not to break lines between those two letters when they're next to each other.

srl295 commented 6 months ago

also generally speaking not having a locale option is likely to disproportionately cause challenges for minority users of the script. Yes, you might seem to have a common solution for majority users, but then there isn't a customization option if something goes against the majority use. sss-Thai line breaking is an example.

sffc commented 6 months ago

Discuss with:

Optional:

sffc commented 5 months ago

https://developer.mozilla.org/en-US/docs/Web/HTML/Global_attributes/lang

https://github.com/unicode-org/cldr/blob/main/common/segments/ja.xml

Initial recommendations:

  1. No language parameter for grapheme cluster segmenter

LGTM: @sffc @eggrobin @hsivonen @Manishearth

  1. Language parameter for the other three segmenters should be optional parameter, potentially on a separate function, well-documented and geared toward power users who can ensure the quality of the hint

LGTM: @sffc @hsivonen @eggrobin @Manishearth

  1. Plan to put the parameter on the .segment() function, dependent on investigation of the call sites in order to optimize data slicing.

LGTM: @sffc @hsivonen @Manishearth

Manishearth commented 5 months ago

colloquial Cantonese

Henri later provided me context: https://github.com/tc39/proposal-intl-segmenter/issues/133#issuecomment-779777282

Since "colloquial cantonese" is ambiguous, I'll expand on this a bit: the distinction being talked about is "standard written Chinese" (which can be used to write spoken Cantonese but uses a lot more Mandarin vocabulary — it's complicated), vs "written Cantonese", which is more of a 1-1 mapping to spoken Cantonese and may use Cantonese-specific words and characters.

It's also sometimes called "written vernacular Cantonese" however this can be somewhat ambiguous since Written Vernacular Chinese is something else.

markusicu commented 5 months ago

CLDR has decided that they're putting the : thing back into root.

Yes, but the Apple rep in the meeting, who is originally from Sweden, insisted that CLDR keep the fi/sv word break tailorings, because he thinks that even the future keyword selecting technical usage should keep fi/sv words together across colon.

No language parameter for grapheme cluster segmenter

+1

Language parameter for the other three segmenters

+1

Plan to put the parameter on the .segment() function

That seems weird both from looking at usage and thinking about data loading.

sffc commented 4 months ago

The conclusions from the discussion of this issue with the CLDR design group:

My suggested path forward for this issue, then, is to add an options bag to the WordSegmenter, LineSegmenter, and SentenceSegmenter constructors with an optional content_locale field of type &LanguageIdentifier.

sffc commented 4 months ago

I'm moving this back into 1.5 because the constructor can be drafted and bikeshed ahead of time, and then in 2.0 we can do the minimal change of making the new constructor the default one.

srl295 commented 4 months ago

Grapheme clusters should not be language-specific; baked into much low-level processing (e.g., Swift, font mappings) which we don’t want to be language-specific

This makes no sense and contradicts the long standing requests. ( https://unicode-org.atlassian.net/browse/CLDR-2992 which I am working on scheduling ) I would have joined, did not realize this was coming up today.

Perusing the notes it's not clear that the previous requirements and recent discussion from the segmentation summary last year were included here.

sffc commented 3 months ago

Based on additional discussion in the email thread, I would like to move forward with the recommendation in https://github.com/unicode-org/icu4x/issues/3284#issuecomment-2030731133, with the additional understanding that we may add support for locale-based grapheme segmentation in the future if CLDR adds data for this, but it might take the form of another (fifth) segmenter type.

Concretely:

makotokato commented 1 month ago

When looking ICU4C brkiter rule files for word and sentence, UAX#29's property of this isn't same each locale. But rules seem to be same. So if we modify datagen (with a few changes of toml data file), we can generate rules data per locale.