Stabilize the icu_casemap component

sffc commented 1 year ago

This issue tracks the work to release icu_casemap as a stable component.

Checklist (not exhaustive)

[x] Fix data model (https://github.com/unicode-org/icu4x/issues/3481)
[x] Code organization (https://github.com/unicode-org/icu4x/pull/3528)
[x] Expose case closure methods (https://github.com/unicode-org/icu4x/pull/3544)
[x] full_fold() functions should work with Writeable (https://github.com/unicode-org/icu4x/pull/3544)
[x] Add tests. Ideally upstream conformance tests (https://github.com/unicode-org/icu4x/pull/3553)
[x] Add docs (https://github.com/unicode-org/icu4x/pull/3553)
[x] Enable boilerplate lints (https://github.com/unicode-org/icu4x/pull/3544)
[x] Make no_std (https://github.com/unicode-org/icu4x/pull/3544)
[x] FFI
- [x] basic FFI https://github.com/unicode-org/icu4x/pull/3663
- [x] Add CodePointSetBuilder rather than CPILBuilder over FFI https://github.com/unicode-org/icu4x/pull/3679
- [x] FFI tests https://github.com/unicode-org/icu4x/pull/3679
[x] Move data struct validation into deserialize to allow validation-free databake
[x] Review error types, most variants seem to be datagen only
[x] Check if we support SpecialCasing.txt (https://github.com/unicode-org/icu4x/pull/3583)
[x] Make string APIs more prominent than char APIs (https://github.com/unicode-org/icu4x/pull/3584)
[x] Move locale to methods (except for fold/fold_turkic, https://github.com/unicode-org/icu4x/pull/3584)
[x] Add titlecase_segment() (https://github.com/unicode-org/icu4x/pull/3593)
[x] Handle modifiers when performing contextual operations
[x] https://github.com/unicode-org/icu4x/pull/3616
[x] Figure out Greek accent uppercasing (https://github.com/unicode-org/icu4x/issues/3552)
[x] Also go through graduation.md (checklist in https://github.com/unicode-org/icu4x/issues/3234#issuecomment-1636767460)
- [x] https://github.com/unicode-org/icu4x/pull/3689
[x] Maybe use more UnvalidatedStr in the data model (https://github.com/unicode-org/icu4x/pull/3690)
[x] Points from Markus' second review
- [x] Decoding should debug_assert on garbage (https://github.com/unicode-org/icu4x/pull/3759)
- [x] split out separate TitlecaseMapper<impl AsRef<CaseMapper>> type, ctors: new(), new_legacy(), and then the same ones _with_mapper https://github.com/unicode-org/icu4x/pull/3779
- [x] titlecase head adjustment support, with TitleCaseOptions, fields HeadAdjustment { #[default] Adjust, NoAdjust }, TailCasing { #[default] Lowercase, PreserveCase } https://github.com/unicode-org/icu4x/pull/3779
- [x] titlecase_segment_legacy() on CaseMapper if wanted https://github.com/unicode-org/icu4x/pull/3779
- [x] CaseCloser wrapper for case closure things, similar ctors (https://github.com/unicode-org/icu4x/pull/3759)
[x] https://github.com/unicode-org/icu4x/issues/3801

Manishearth commented 1 year ago

Added a checklist, please append more items to it

Manishearth commented 1 year ago

[ ] Move data struct validation into deserialize to allow validation-free databake

@robertbastian i'm not convinced we should be doing the heavy validation in serde: it's debug-assertions only, and the struct has GIGO behavior if you give it bad data. I think this is the right call, since there's a lot of work in properly validating this data otherwise.

There are a couple places that are currently relying on validate that I need to GIGO

robertbastian commented 1 year ago

I don't care, as long as we don't do any validation in databake and I can make the constructor const

Manishearth commented 1 year ago

There shouldn't be right now, if there is you are welcome to remove it

Manishearth commented 1 year ago

Useful note for later: the unfold data currently in use in icu4x

```rust [("aʾ", "ẚ"), ("ff", "ﬀ"), ("ffi", "ﬃ"), ("ffl", "ﬄ"), ("fi", "ﬁ"), ("fl", "ﬂ"), ("h\u{331}", "ẖ"), ("i\u{307}", "İ"), ("j\u{30c}", "ǰ"), ("ss", "ßẞ"), ("st", "ﬅﬆ"), ("t\u{308}", "ẗ"), ("w\u{30a}", "ẘ"), ("y\u{30a}", "ẙ"), ("ʼn", "ŉ"), ("άι", "ᾴ"), ("ήι", "ῄ"), ("α\u{342}", "ᾶ"), ("α\u{342}ι", "ᾷ"), ("αι", "ᾳᾼ"), ("η\u{342}", "ῆ"), ("η\u{342}ι", "ῇ"), ("ηι", "ῃῌ"), ("ι\u{308}\u{300}", "ῒ"), ("ι\u{308}\u{301}", "ΐΐ"), ("ι\u{308}\u{342}", "ῗ"), ("ι\u{342}", "ῖ"), ("ρ\u{313}", "ῤ"), ("υ\u{308}\u{300}", "ῢ"), ("υ\u{308}\u{301}", "ΰΰ"), ("υ\u{308}\u{342}", "ῧ"), ("υ\u{313}", "ὐ"), ("υ\u{313}\u{300}", "ὒ"), ("υ\u{313}\u{301}", "ὔ"), ("υ\u{313}\u{342}", "ὖ"), ("υ\u{342}", "ῦ"), ("ω\u{342}", "ῶ"), ("ω\u{342}ι", "ῷ"), ("ωι", "ῳῼ"), ("ώι", "ῴ"), ("եւ", "և"), ("մե", "ﬔ"), ("մի", "ﬕ"), ("մխ", "ﬗ"), ("մն", "ﬓ"), ("վն", "ﬖ"), ("ἀι", "ᾀᾈ"), ("ἁι", "ᾁᾉ"), ("ἂι", "ᾂᾊ"), ("ἃι", "ᾃᾋ"), ("ἄι", "ᾄᾌ"), ("ἅι", "ᾅᾍ"), ("ἆι", "ᾆᾎ"), ("ἇι", "ᾇᾏ"), ("ἠι", "ᾐᾘ"), ("ἡι", "ᾑᾙ"), ("ἢι", "ᾒᾚ"), ("ἣι", "ᾓᾛ"), ("ἤι", "ᾔᾜ"), ("ἥι", "ᾕᾝ"), ("ἦι", "ᾖᾞ"), ("ἧι", "ᾗᾟ"), ("ὠι", "ᾠᾨ"), ("ὡι", "ᾡᾩ"), ("ὢι", "ᾢᾪ"), ("ὣι", "ᾣᾫ"), ("ὤι", "ᾤᾬ"), ("ὥι", "ᾥᾭ"), ("ὦι", "ᾦᾮ"), ("ὧι", "ᾧᾯ"), ("ὰι", "ᾲ"), ("ὴι", "ῂ"), ("ὼι", "ῲ")] ```

Manishearth commented 1 year ago

Should https://github.com/unicode-org/icu4x/issues/3552 be a stabilization blocker? Feels like we can count it as a known bug.

Manishearth commented 1 year ago

Also, ICU4C supports str.toTitle(locale, options). Should we as well? Currently we do not support options, we can add them as an API later (to_upper_with_options() or something)

Also str.caseCompare(str2), whcih also takes options

Manishearth commented 1 year ago

The toTitle() functions need a break iterator: we need to add APIs which:

Instantiate a default word break iterator
Allow passing in a word break iterator

I also think this can be designed in a later release.

Manishearth commented 1 year ago

question (@sffc , @robertbastian ): Do the current function names look good to you?

to_uppercase(char) -> char
to_full_uppercase(&str) -> Writeable
to_full_uppercase_string(&str) -> String

I somewhat feel like the stringy one should be the one with the shorter name, and the char one should be to_uppercase_char()

Manishearth commented 1 year ago

A change we should make is move the locale from the constructor to the methods. We can make them all take CaseMappingLocale and have people call .into() or Default::default(). We could also give them _with_locale() versions but that might lead to a large proliferation.

And how should it look over FFI (where locales are not free to instantiate). I guess we can expose that enum.

Thoughts?

eggrobin commented 1 year ago

Do the current function names look good to you?

to_uppercase(char) -> char to_full_uppercase(&str) -> Writeable to_full_uppercase_string(&str) -> String I somewhat feel like the stringy one should be the one with the shorter name, and the char one should be to_uppercase_char()

The simple mappings and foldings (which you should practically never use unless you have specific compatibility requirements) having shorter and more default-looking names than the default ones seems like a bad idea. I would favour renaming all of the char-to-char functions to have simple in the name, this is a well-established term in this context.

A change we should make is move the locale from the constructor to the methods.

Something like that would be a good idea. It is very weird that the case foldings look like they depend on the locale, which they do not and must not.

sffc commented 1 year ago

Maybe to_uppercase(&str) -> Writeable, to_uppercase_string(&str) -> String, to_uppercase_char(char) -> char ?

What is your idea for CaseMappingLocale and how does it differ from Locale or DataLocale? Keep in mind that we're still pending the rearchitecting of ICU4X's Locale/Preferences handling so I might not want to deviate too far from the existing types given that they might change again in the near future.

robertbastian commented 1 year ago

We might actually want to accept W: Writeable like list formatting does already.

Manishearth commented 1 year ago

What is your idea for CaseMappingLocale and how does it differ from Locale or DataLocale? Keep in mind that we're still pending the rearchitecting of ICU4X's Locale/Preferences handling so I might not want to deviate too far from the existing types given that they might change again in the near future.

It's what's already used internally, it's a simple enum:

pub enum CaseMapLocale {
    Root,
    Turkish,
    Lithuanian,
    Greek,
    Dutch,
    Armenian,
}

and it basically covers the different casemapping special-case modes. ICU4C's API consumes a Locale; but it does feel potentially faster to not require a conversion each time, and exposing something that is From<Locale> seems fine.

This way we only require the actual subpart of the locale the algorithm cares about, instead of having clients treat it opaquely.

Manishearth commented 1 year ago

Copying over Markus' feedback from the ICU4X team meeting notes

What's the difference between try_new and try_new_with_locale?

Give examples for all the functions, especially to_titlecase(char)->char (dz -> Dz)

Do we support specialcasing.txt?

There are some things that are only in ICU and not in data. Example: Greek uppercasing, which drops most but not all accents, but with certain exceptions. It's not representable in data, so it is coded manually.

Long term we want the Edits API. Clients need it like Chrome.

People will want full titlecase, but it requires segmentation. Maybe make it pluggable with a trait. A default implementation could be to titlecase only the first character. When titlecasing, sometimes you want to leave interior characters along, and sometimes you want to convert them to lowercase.

Take a look at the ICU4J CaseMap API!

In ICU, the locale-special data are stored in the same trie as root data, so we can specify the locale later.

For case folding, the locale doesn't matter, only turkic and non-turkic. Right now it's confusing because it looks like the locale could influence case folding.

I do think I'm going to go along the path of doing titlecase with a segmentation trait.

sffc commented 1 year ago

If I had to choose between full-string titlecase and Greek uppercase, I think Greek uppercase is more important since it is about i18n correctness.

The Locale thing reminds me a lot of what happens with Collation and Segmentation tailorings.

Manishearth commented 1 year ago

I've copied the actionable bits of Markus' feedback, as well as stuff discussed here, into the issue above.

I'm not sure if it's either-or: full-string titlecase isn't that tricky, whereas Greek uppercase seems to involve reimplementing half of the uppercasing algorithm, without any spec for reference. It's a lot more work.

Manishearth commented 1 year ago

Discussions to have:

General API shape (Rust and FFI) for locale specific stuff
Greek uppercasing thing
Titlecasing design

sffc commented 1 year ago

Discuss with:

@sffc
@robertbastian
@eggrobin
@Manishearth

Manishearth commented 1 year ago

Discussion:

CaseMapLocale:

@robertbastian Probably should take a locale on the methods (as opposed to a custom enum)
@sffc Agree, I consider it a language hint
@eggrobin Note that folding should not be treated as locale-dependent, "turkic casefolding" should be considered to be a different operation
@eggrobin pointing out that the spec has four casefolding operations: toCaseFold, NFD ∘ toCaseFold ∘ NFD, NFKD ∘ toCaseFold ∘ NFD, toNFKC_Casefold ∘ NFD
Should we accept Language or Locale
@Manishearth: note that in the future someone may decide they want de-u-cm-ß
If we do an enum (or opaque struct wrapping an enum) it shuld be called CaseMapLanguageHint
Decision: LanguageIdentifier, Locale over FFI
Decision: No to_uppercase_root()

Titlecasing:

Don't expose full segmentation based API. Expose titlecase_segment(&str, &langid) -> String / Writeable that does not handle any segmentation
Add more if someone asks for it

Simple case mapping:

Methods called to_simple_lowercase(char) -> char.

Greek uppercasing:

@sffc Should not treat ICU4C parity as a goal
@Manishearth and @eggrobin: "i18n correct" is not feasible without tons of research
@sffc iff this is going to be coming to CLDR/some other body, we should wait for it. If not, we should block our release.
@Manishearth and @robertbastian do not disagree
- @eggrobin We could take the position that we do best-effort ICU4C parity
- Decision: Defer until PAG meeting, based on response decide to block release or not. Potentially implement anyway.

Agreed: @Manishearth @sffc @robertbastian @eggrobin

Manishearth commented 1 year ago

Graduation checklist

[x] The crate should fully conform with the ICU4X Style Guide:
- [x] The crate should have a complete library header as shown in boilerplate.md with Clippy passing
- [x] The names of exported types should conform to the recommendations in the Style Guide
- [x] All Error types should be Copy
- [x] All constructors should take options structs by value
- [x] All constructors should take arguments in the following order: Provider, Locale, Options
- [x] All constructors should have the standard set of overloads for provider types
- [x] Runtime dependencies should be minimal and able to be disabled with Cargo features if possible
- [ ] Any TODO, FIXME, todo!, unimplemented!, or other placeholders should either be resolved or link to an issue number. (It is okay to ship a small amount of code with tech debt comments, but anything having to do with code correctness should be resolved)
[x] The crate should have a conventional Cargo.toml file:
- [x] Cargo.toml should use license, not license-file
- [x] The description should be useful
- [x] Correct repo link, authors, categories, include
- [x] Correct docs.rs and cargo-all-features metadata settings
- [x] A rust-version field with the MSRV of this crate in accordance with the current ICU4X policies on MSRV.
- [x] The crate should have an std feature if (and only if) it contains code that depends on std, such as file I/O or implementing the Error trait
- [x] Audit all features so that any foo/bar in a feature is foo?/bar when foo is an optional dep; if you intend to enable foo, add two entries
- [x] Use dep: for enabling dependencies
[x] The crate should be fully documented
- [x] Every exported function should have docs coverage
- [x] There should be a crate-level example that illustrates a common use case for the component with the heading # Examples
- [x] All options and conditional code paths should have a corresponding docs test with the heading # Examples
- [x] All functions that are conditional on a Cargo feature should say so (last line before # Examples): ✨ *Enabled with the `alloc` Cargo feature.*
- [x] Compiled data constructors should say "with compiled data" in the first sentence and should have a Cargo feature alert following the above syntax.
[x] The data structs should fully follow ZeroVec style
- [x] Deserialization should not have a "net violation" in the verify-zero-copy test
- [x] Constructors should avoid allocating memory in the common case
- [x] Opaque blobs of data should be avoided if possible (instead use VarZeroVec, ZeroMap, etc.)
- [x] Data structs should not be panicky to load/deserialize and conform to data_safety.md
[x] The component should be fully integrated with ICU4X tooling
- [x] There should be an overview Criterion benchmark
- [x] There should be individual Criterion benchmarks for interesting or performance critical code paths
- [x] There should be at least one example plumbed with the icu_benchmark_macros
- [x] The component should be fully covered by FFI with no unjustified suppressions nor entries in missing_apis.txt
[x] The component should encourage i18n best practices
- [x] The component should produce correct results in all locales as generated by icu4x-datagen
- [x] Where applicable, the component should be consistent with ECMA-402 and UTS #35
- [x] Any gaps in i18n quality should be fixed, or, if that is not possible, they should have tracking issues and a concrete, resourced path forward. The intent is to not ship components with known i18n correctness problems and no plan to fix them in an upcoming release
- One known gap is irish casefolding, there is an upstream issue to figure that out CLDR-13985
- [x] The API design should receive sign-off from a non-ICU4X i18n expert such as Markus Scherer

Manishearth commented 1 year ago

A bunch of the ticked-off items above are fixed in https://github.com/unicode-org/icu4x/pull/3689

@sffc I think there are a couple entries in the checklist that may benefit from a 15 minute pair discussion where we verify it together. Specifically: the style guide naming part, i18n correctness, and data struct design

Manishearth commented 1 year ago

Only remaining stabilization blocker is Shane (or someone else) and I should go through everything.

(and then move the folder)

Manishearth commented 1 year ago

[x] debug_assert in Exceptions datagen when doing GIGO

sffc commented 1 year ago

I added the new checkboxes from #3693 to the comment above.

A few things I notice:

You have # Example instead of # Examples (I know, minor silly thing that I don't necessarily agree with, but this is the time to fix it)
I don't see examples for TailCasing or HeadAdjustment; it looks like you cover them in the docs for TitlecaseMapper itself, which is fine, but please add a link.
This check box is not checked yet: "There should be at least one example plumbed with the icu_benchmark_macros"

Manishearth commented 1 year ago

This check box is not checked yet: "There should be at least one example plumbed with the icu_benchmark_macros"

Yeah I was planning to do that later. But I'll just roll it into #3803

Manishearth commented 1 year ago

I think I have handled every box on this page except for https://github.com/unicode-org/icu4x/issues/3801 in https://github.com/unicode-org/icu4x/pull/3803

(even if not, I would prefer to land that rather than require that PR fix everything)

Manishearth commented 1 year ago

Final checkbox checked by https://github.com/unicode-org/icu4x/pull/3843

We're done!

unicode-org / icu4x

Stabilize the icu_casemap component #3234