Closed sffc closed 1 year ago
Added a checklist, please append more items to it
[ ] Move data struct validation into
deserialize
to allow validation-free databake
@robertbastian i'm not convinced we should be doing the heavy validation in serde: it's debug-assertions only, and the struct has GIGO behavior if you give it bad data. I think this is the right call, since there's a lot of work in properly validating this data otherwise.
There are a couple places that are currently relying on validate that I need to GIGO
I don't care, as long as we don't do any validation in databake and I can make the constructor const
There shouldn't be right now, if there is you are welcome to remove it
Useful note for later: the unfold data currently in use in icu4x
Should https://github.com/unicode-org/icu4x/issues/3552 be a stabilization blocker? Feels like we can count it as a known bug.
Also, ICU4C supports str.toTitle(locale, options)
. Should we as well? Currently we do not support options
, we can add them as an API later (to_upper_with_options()
or something)
Also str.caseCompare(str2)
, whcih also takes options
The toTitle()
functions need a break iterator: we need to add APIs which:
I also think this can be designed in a later release.
question (@sffc , @robertbastian ): Do the current function names look good to you?
to_uppercase(char) -> char
to_full_uppercase(&str) -> Writeable
to_full_uppercase_string(&str) -> String
I somewhat feel like the stringy one should be the one with the shorter name, and the char one should be to_uppercase_char()
A change we should make is move the locale from the constructor to the methods. We can make them all take CaseMappingLocale
and have people call .into()
or Default::default()
. We could also give them _with_locale()
versions but that might lead to a large proliferation.
And how should it look over FFI (where locales are not free to instantiate). I guess we can expose that enum.
Thoughts?
Do the current function names look good to you?
to_uppercase(char) -> char to_full_uppercase(&str) -> Writeable to_full_uppercase_string(&str) -> String I somewhat feel like the stringy one should be the one with the shorter name, and the char one should be to_uppercase_char()
The simple mappings and foldings (which you should practically never use unless you have specific compatibility requirements) having shorter and more default-looking names than the default ones seems like a bad idea. I would favour renaming all of the char-to-char functions to have simple
in the name, this is a well-established term in this context.
A change we should make is move the locale from the constructor to the methods.
Something like that would be a good idea. It is very weird that the case foldings look like they depend on the locale, which they do not and must not.
Maybe to_uppercase(&str) -> Writeable
, to_uppercase_string(&str) -> String
, to_uppercase_char(char) -> char
?
What is your idea for CaseMappingLocale
and how does it differ from Locale or DataLocale? Keep in mind that we're still pending the rearchitecting of ICU4X's Locale/Preferences handling so I might not want to deviate too far from the existing types given that they might change again in the near future.
We might actually want to accept W: Writeable
like list formatting does already.
What is your idea for
CaseMappingLocale
and how does it differ from Locale or DataLocale? Keep in mind that we're still pending the rearchitecting of ICU4X's Locale/Preferences handling so I might not want to deviate too far from the existing types given that they might change again in the near future.
It's what's already used internally, it's a simple enum:
pub enum CaseMapLocale {
Root,
Turkish,
Lithuanian,
Greek,
Dutch,
Armenian,
}
and it basically covers the different casemapping special-case modes. ICU4C's API consumes a Locale; but it does feel potentially faster to not require a conversion each time, and exposing something that is From<Locale>
seems fine.
This way we only require the actual subpart of the locale the algorithm cares about, instead of having clients treat it opaquely.
Copying over Markus' feedback from the ICU4X team meeting notes
- What's the difference between try_new and try_new_with_locale?
- Give examples for all the functions, especially to_titlecase(char)->char (dz -> Dz)
- Do we support specialcasing.txt?
- There are some things that are only in ICU and not in data. Example: Greek uppercasing, which drops most but not all accents, but with certain exceptions. It's not representable in data, so it is coded manually.
- Long term we want the Edits API. Clients need it like Chrome.
- People will want full titlecase, but it requires segmentation. Maybe make it pluggable with a trait. A default implementation could be to titlecase only the first character. When titlecasing, sometimes you want to leave interior characters along, and sometimes you want to convert them to lowercase.
- Take a look at the ICU4J CaseMap API!
- In ICU, the locale-special data are stored in the same trie as root data, so we can specify the locale later.
- For case folding, the locale doesn't matter, only turkic and non-turkic. Right now it's confusing because it looks like the locale could influence case folding.
I do think I'm going to go along the path of doing titlecase with a segmentation trait.
If I had to choose between full-string titlecase and Greek uppercase, I think Greek uppercase is more important since it is about i18n correctness.
The Locale thing reminds me a lot of what happens with Collation and Segmentation tailorings.
I've copied the actionable bits of Markus' feedback, as well as stuff discussed here, into the issue above.
I'm not sure if it's either-or: full-string titlecase isn't that tricky, whereas Greek uppercase seems to involve reimplementing half of the uppercasing algorithm, without any spec for reference. It's a lot more work.
Discussions to have:
Discuss with:
Discussion:
CaseMapLocale:
de-u-cm-ß
CaseMapLanguageHint
to_uppercase_root()
Titlecasing:
titlecase_segment(&str, &langid) -> String / Writeable
that does not handle any segmentationSimple case mapping:
to_simple_lowercase(char) -> char
. Greek uppercasing:
Agreed: @Manishearth @sffc @robertbastian @eggrobin
Graduation checklist
Copy
TODO
, FIXME
, todo!
, unimplemented!
, or other placeholders should either be resolved or link to an issue number. (It is okay to ship a small amount of code with tech debt comments, but anything having to do with code correctness should be resolved)rust-version
field with the MSRV of this crate in accordance with the current ICU4X policies on MSRV.std
feature if (and only if) it contains code that depends on std
, such as file I/O or implementing the Error traitfoo/bar
in a feature is foo?/bar
when foo
is an optional dep; if you intend to enable foo
, add two entriesdep:
for enabling dependencies# Examples
# Examples
# Examples
): ✨ *Enabled with the `alloc` Cargo feature.*
A bunch of the ticked-off items above are fixed in https://github.com/unicode-org/icu4x/pull/3689
@sffc I think there are a couple entries in the checklist that may benefit from a 15 minute pair discussion where we verify it together. Specifically: the style guide naming part, i18n correctness, and data struct design
Only remaining stabilization blocker is Shane (or someone else) and I should go through everything.
(and then move the folder)
I added the new checkboxes from #3693 to the comment above.
A few things I notice:
# Example
instead of # Examples
(I know, minor silly thing that I don't necessarily agree with, but this is the time to fix it)TitlecaseMapper
itself, which is fine, but please add a link.This check box is not checked yet: "There should be at least one example plumbed with the icu_benchmark_macros"
Yeah I was planning to do that later. But I'll just roll it into #3803
I think I have handled every box on this page except for https://github.com/unicode-org/icu4x/issues/3801 in https://github.com/unicode-org/icu4x/pull/3803
(even if not, I would prefer to land that rather than require that PR fix everything)
Final checkbox checked by https://github.com/unicode-org/icu4x/pull/3843
We're done!
This issue tracks the work to release icu_casemap as a stable component.
Checklist (not exhaustive)
full_fold()
functions should work with Writeable (https://github.com/unicode-org/icu4x/pull/3544)no_std
(https://github.com/unicode-org/icu4x/pull/3544)deserialize
to allow validation-free databaketitlecase_segment()
(https://github.com/unicode-org/icu4x/pull/3593)TitlecaseMapper<impl AsRef<CaseMapper>>
type, ctors:new()
,new_legacy()
, and then the same ones_with_mapper
https://github.com/unicode-org/icu4x/pull/3779TitleCaseOptions
, fieldsHeadAdjustment { #[default] Adjust, NoAdjust }
,TailCasing { #[default] Lowercase, PreserveCase }
https://github.com/unicode-org/icu4x/pull/3779titlecase_segment_legacy()
on CaseMapper if wanted https://github.com/unicode-org/icu4x/pull/3779CaseCloser
wrapper for case closure things, similar ctors (https://github.com/unicode-org/icu4x/pull/3759)