unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.38k stars 176 forks source link

Stabilize the icu_casemap component #3234

Closed sffc closed 1 year ago

sffc commented 1 year ago

This issue tracks the work to release icu_casemap as a stable component.

Checklist (not exhaustive)

Manishearth commented 1 year ago

Added a checklist, please append more items to it

Manishearth commented 1 year ago

[ ] Move data struct validation into deserialize to allow validation-free databake

@robertbastian i'm not convinced we should be doing the heavy validation in serde: it's debug-assertions only, and the struct has GIGO behavior if you give it bad data. I think this is the right call, since there's a lot of work in properly validating this data otherwise.

There are a couple places that are currently relying on validate that I need to GIGO

robertbastian commented 1 year ago

I don't care, as long as we don't do any validation in databake and I can make the constructor const

Manishearth commented 1 year ago

There shouldn't be right now, if there is you are welcome to remove it

Manishearth commented 1 year ago

Useful note for later: the unfold data currently in use in icu4x

```rust [("aʾ", "ẚ"), ("ff", "ff"), ("ffi", "ffi"), ("ffl", "ffl"), ("fi", "fi"), ("fl", "fl"), ("h\u{331}", "ẖ"), ("i\u{307}", "İ"), ("j\u{30c}", "ǰ"), ("ss", "ßẞ"), ("st", "ſtst"), ("t\u{308}", "ẗ"), ("w\u{30a}", "ẘ"), ("y\u{30a}", "ẙ"), ("ʼn", "ʼn"), ("άι", "ᾴ"), ("ήι", "ῄ"), ("α\u{342}", "ᾶ"), ("α\u{342}ι", "ᾷ"), ("αι", "ᾳᾼ"), ("η\u{342}", "ῆ"), ("η\u{342}ι", "ῇ"), ("ηι", "ῃῌ"), ("ι\u{308}\u{300}", "ῒ"), ("ι\u{308}\u{301}", "ΐΐ"), ("ι\u{308}\u{342}", "ῗ"), ("ι\u{342}", "ῖ"), ("ρ\u{313}", "ῤ"), ("υ\u{308}\u{300}", "ῢ"), ("υ\u{308}\u{301}", "ΰΰ"), ("υ\u{308}\u{342}", "ῧ"), ("υ\u{313}", "ὐ"), ("υ\u{313}\u{300}", "ὒ"), ("υ\u{313}\u{301}", "ὔ"), ("υ\u{313}\u{342}", "ὖ"), ("υ\u{342}", "ῦ"), ("ω\u{342}", "ῶ"), ("ω\u{342}ι", "ῷ"), ("ωι", "ῳῼ"), ("ώι", "ῴ"), ("եւ", "և"), ("մե", "ﬔ"), ("մի", "ﬕ"), ("մխ", "ﬗ"), ("մն", "ﬓ"), ("վն", "ﬖ"), ("ἀι", "ᾀᾈ"), ("ἁι", "ᾁᾉ"), ("ἂι", "ᾂᾊ"), ("ἃι", "ᾃᾋ"), ("ἄι", "ᾄᾌ"), ("ἅι", "ᾅᾍ"), ("ἆι", "ᾆᾎ"), ("ἇι", "ᾇᾏ"), ("ἠι", "ᾐᾘ"), ("ἡι", "ᾑᾙ"), ("ἢι", "ᾒᾚ"), ("ἣι", "ᾓᾛ"), ("ἤι", "ᾔᾜ"), ("ἥι", "ᾕᾝ"), ("ἦι", "ᾖᾞ"), ("ἧι", "ᾗᾟ"), ("ὠι", "ᾠᾨ"), ("ὡι", "ᾡᾩ"), ("ὢι", "ᾢᾪ"), ("ὣι", "ᾣᾫ"), ("ὤι", "ᾤᾬ"), ("ὥι", "ᾥᾭ"), ("ὦι", "ᾦᾮ"), ("ὧι", "ᾧᾯ"), ("ὰι", "ᾲ"), ("ὴι", "ῂ"), ("ὼι", "ῲ")] ```
Manishearth commented 1 year ago

Should https://github.com/unicode-org/icu4x/issues/3552 be a stabilization blocker? Feels like we can count it as a known bug.

Manishearth commented 1 year ago

Also, ICU4C supports str.toTitle(locale, options). Should we as well? Currently we do not support options, we can add them as an API later (to_upper_with_options() or something)

Also str.caseCompare(str2), whcih also takes options

Manishearth commented 1 year ago

The toTitle() functions need a break iterator: we need to add APIs which:

I also think this can be designed in a later release.

Manishearth commented 1 year ago

question (@sffc , @robertbastian ): Do the current function names look good to you?

I somewhat feel like the stringy one should be the one with the shorter name, and the char one should be to_uppercase_char()

Manishearth commented 1 year ago

A change we should make is move the locale from the constructor to the methods. We can make them all take CaseMappingLocale and have people call .into() or Default::default(). We could also give them _with_locale() versions but that might lead to a large proliferation.

And how should it look over FFI (where locales are not free to instantiate). I guess we can expose that enum.

Thoughts?

eggrobin commented 1 year ago

Do the current function names look good to you?

to_uppercase(char) -> char to_full_uppercase(&str) -> Writeable to_full_uppercase_string(&str) -> String I somewhat feel like the stringy one should be the one with the shorter name, and the char one should be to_uppercase_char()

The simple mappings and foldings (which you should practically never use unless you have specific compatibility requirements) having shorter and more default-looking names than the default ones seems like a bad idea. I would favour renaming all of the char-to-char functions to have simple in the name, this is a well-established term in this context.

A change we should make is move the locale from the constructor to the methods.

Something like that would be a good idea. It is very weird that the case foldings look like they depend on the locale, which they do not and must not.

sffc commented 1 year ago

Maybe to_uppercase(&str) -> Writeable, to_uppercase_string(&str) -> String, to_uppercase_char(char) -> char ?

What is your idea for CaseMappingLocale and how does it differ from Locale or DataLocale? Keep in mind that we're still pending the rearchitecting of ICU4X's Locale/Preferences handling so I might not want to deviate too far from the existing types given that they might change again in the near future.

robertbastian commented 1 year ago

We might actually want to accept W: Writeable like list formatting does already.

Manishearth commented 1 year ago

What is your idea for CaseMappingLocale and how does it differ from Locale or DataLocale? Keep in mind that we're still pending the rearchitecting of ICU4X's Locale/Preferences handling so I might not want to deviate too far from the existing types given that they might change again in the near future.

It's what's already used internally, it's a simple enum:

pub enum CaseMapLocale {
    Root,
    Turkish,
    Lithuanian,
    Greek,
    Dutch,
    Armenian,
}

and it basically covers the different casemapping special-case modes. ICU4C's API consumes a Locale; but it does feel potentially faster to not require a conversion each time, and exposing something that is From<Locale> seems fine.

This way we only require the actual subpart of the locale the algorithm cares about, instead of having clients treat it opaquely.

Manishearth commented 1 year ago

Copying over Markus' feedback from the ICU4X team meeting notes

  • What's the difference between try_new and try_new_with_locale?
  • Give examples for all the functions, especially to_titlecase(char)->char (dz -> Dz)
  • Do we support specialcasing.txt?
  • There are some things that are only in ICU and not in data. Example: Greek uppercasing, which drops most but not all accents, but with certain exceptions. It's not representable in data, so it is coded manually.
  • Long term we want the Edits API. Clients need it like Chrome.
  • People will want full titlecase, but it requires segmentation. Maybe make it pluggable with a trait. A default implementation could be to titlecase only the first character. When titlecasing, sometimes you want to leave interior characters along, and sometimes you want to convert them to lowercase.
  • Take a look at the ICU4J CaseMap API!
  • In ICU, the locale-special data are stored in the same trie as root data, so we can specify the locale later.
  • For case folding, the locale doesn't matter, only turkic and non-turkic. Right now it's confusing because it looks like the locale could influence case folding.

I do think I'm going to go along the path of doing titlecase with a segmentation trait.

sffc commented 1 year ago

If I had to choose between full-string titlecase and Greek uppercase, I think Greek uppercase is more important since it is about i18n correctness.

The Locale thing reminds me a lot of what happens with Collation and Segmentation tailorings.

Manishearth commented 1 year ago

I've copied the actionable bits of Markus' feedback, as well as stuff discussed here, into the issue above.

I'm not sure if it's either-or: full-string titlecase isn't that tricky, whereas Greek uppercase seems to involve reimplementing half of the uppercasing algorithm, without any spec for reference. It's a lot more work.

Manishearth commented 1 year ago

Discussions to have:

sffc commented 1 year ago

Discuss with:

Manishearth commented 1 year ago

Discussion:

CaseMapLocale:

Titlecasing:

Simple case mapping:

Greek uppercasing:

Agreed: @Manishearth @sffc @robertbastian @eggrobin

Manishearth commented 1 year ago

Graduation checklist

Manishearth commented 1 year ago

A bunch of the ticked-off items above are fixed in https://github.com/unicode-org/icu4x/pull/3689

@sffc I think there are a couple entries in the checklist that may benefit from a 15 minute pair discussion where we verify it together. Specifically: the style guide naming part, i18n correctness, and data struct design

Manishearth commented 1 year ago

Only remaining stabilization blocker is Shane (or someone else) and I should go through everything.

(and then move the folder)

Manishearth commented 1 year ago
sffc commented 1 year ago

I added the new checkboxes from #3693 to the comment above.

A few things I notice:

Manishearth commented 1 year ago

This check box is not checked yet: "There should be at least one example plumbed with the icu_benchmark_macros"

Yeah I was planning to do that later. But I'll just roll it into #3803

Manishearth commented 1 year ago

I think I have handled every box on this page except for https://github.com/unicode-org/icu4x/issues/3801 in https://github.com/unicode-org/icu4x/pull/3803

(even if not, I would prefer to land that rather than require that PR fix everything)

Manishearth commented 1 year ago

Final checkbox checked by https://github.com/unicode-org/icu4x/pull/3843

We're done!