unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.39k stars 179 forks source link

Evaluate consistency and naming of char vs u32 methods in icu_collections and icu_properties #2413

Open sffc opened 2 years ago

sffc commented 2 years ago

We inconsistently name methods in the various properties and collections classes that deal with char vs u32. Examples: contains(char), contains_32(u32), get(char), and get_u32(u32), but sometimes it is get(u32). And the get_u32 name sounds like it is returning a u32, similar to get_ule, when in fact it is an overload of the get method.

Feedback from @markusicu.

Thoughts?

markusicu commented 2 years ago

I think the contains overloads work well. Consider changing get_u32 to get_for_u32 or get_from_u32. If a class/trait only ever deals with u32 and not char, then get(u32) should be fine.

Manishearth commented 2 years ago

I think get_from_u32 might be good yeah

Though I'm skeptical we should have these in the first place, I guess. it's easy enough to as u32 the char.

sffc commented 2 years ago

Concretely, the classes and functions in question are

CodePointTrie::get is the only non-suffixed function to take a u32 argument. In CodePointTrie, get_u32 returns a u32. In all other places, we are consistent in taking a char.

I'm not sure what my preference is. I'm okay leaving things the way they are, and considering CodePointTrie a special case since it is a low-level collection type. If we start renaming things, what about:

markusicu commented 2 years ago

Note: The data structures are designed to map from code points to values. In Rust, supporting all code points requires u32 because char forbids surrogate code points.

Therefore, one could argue that the primary input should be a u32. Lookup via char would use a cast, or an "override".


get_u32 taking a u32 instead of returning a u32 seems misleading. getu would be better.

sffc commented 2 years ago

Discussion:

Proposal:

OK: @sffc @Manishearth @robertbastian @nordzilla

markusicu commented 2 years ago

The proposal wfm.

robertbastian commented 1 month ago

Still needs docs work

robertbastian commented 1 month ago

Given that we have decided to use try_from_utf8 for unvalidated string constructors, I'd like to reopen this discussion. I think a more consistent naming for the 32 methods would now be contains_utf32. Is this worth changing?

sffc commented 1 month ago

The problem was with get according to the discussion above. If you say get_utf32, the thinking was, then it looks like you are getting a UTF-32 code unit, when in reality you are passing one in as a parameter. (I don't know how I personally feel)

sffc commented 1 month ago

No conclusion yet.

markusicu commented 1 month ago

utf32 is a string encoding. u32 is one possible type for a code point.

sffc commented 1 month ago

Conclusion:

LGTM: @sffc @robertbastian

Manishearth commented 1 month ago

LGTM