Evaluate consistency and naming of char vs u32 methods in icu_collections and icu_properties

sffc commented 2 years ago

We inconsistently name methods in the various properties and collections classes that deal with char vs u32. Examples: contains(char), contains_32(u32), get(char), and get_u32(u32), but sometimes it is get(u32). And the get_u32 name sounds like it is returning a u32, similar to get_ule, when in fact it is an overload of the get method.

Feedback from @markusicu.

Thoughts?

[ ] @echeran
[ ] @Manishearth

markusicu commented 2 years ago

I think the contains overloads work well. Consider changing get_u32 to get_for_u32 or get_from_u32. If a class/trait only ever deals with u32 and not char, then get(u32) should be fine.

Manishearth commented 2 years ago

I think get_from_u32 might be good yeah

Though I'm skeptical we should have these in the first place, I guess. it's easy enough to as u32 the char.

sffc commented 2 years ago

Concretely, the classes and functions in question are

CodePointInversionList: contains, contains_u32
CodePointSetDataBorrowed: contains, contains_u32
CodePointTrie: get, get_u32, get_ule
CanonicalCombiningClassMap: get, get_u32
CodePointMapDataBorrowed: get, get_u32
PropertyCodePointMapV1 internal type: get, get_u32
Char16TrieIterator: next, next_u16, next_u32

CodePointTrie::get is the only non-suffixed function to take a u32 argument. In CodePointTrie, get_u32 returns a u32. In all other places, we are consistent in taking a char.

I'm not sure what my preference is. I'm okay leaving things the way they are, and considering CodePointTrie a special case since it is a low-level collection type. If we start renaming things, what about:

get32 (more concise and doesn't as strongly suggest that we are getting a u32)
geti ("get by integer")
getu ("get by unsigned integer")

markusicu commented 2 years ago

Note: The data structures are designed to map from code points to values. In Rust, supporting all code points requires u32 because char forbids surrogate code points.

Therefore, one could argue that the primary input should be a u32. Lookup via char would use a cast, or an "override".

get_u32 taking a u32 instead of returning a u32 seems misleading. getu would be better.

sffc commented 2 years ago

Discussion:

@robertbastian - u32_get()
@zbraniecki - Rust generally recommends not to have get
@zbraniecki - get_with_u32, get_from_u32
@sffc - These are basically overloads; should we consider shorter method names? like get32
@Manishearth - I like get32 because it doesn't tell my brain that I am getting a u32 return value
@zbraniecki - Do we need both versions with the overloads?
@Manishearth - The use case is that if you have a u32 character, but you don't know if it's valid, you can query the ICU4X collection and get back a valid value, which could be a default/error value.
@zbraniecki - Seems like the Rusty thing is to use TryInto for char. The version using u32 is more specific.
@sffc - There are valid use cases for the u32. If we have only one getter, it should be the u32 version, not the char version.
@robertbastian - Should we implement the Index trait?
@sffc - The Index trait returns by reference and we return by value
@Manishearth - That's a long-standing issue with the Index trait
@robertbastian - TryInto is not the Rusty way to access a collection

Proposal:

get(char)
get32(u32)
contains(char)
contains32(u32)
next(char)
next16(u16) // code unit (Char16Trie only)
next32(u32)

OK: @sffc @Manishearth @robertbastian @nordzilla

markusicu commented 2 years ago

The proposal wfm.

robertbastian commented 1 month ago

Still needs docs work

robertbastian commented 1 month ago

Given that we have decided to use try_from_utf8 for unvalidated string constructors, I'd like to reopen this discussion. I think a more consistent naming for the 32 methods would now be contains_utf32. Is this worth changing?

sffc commented 1 month ago

The problem was with get according to the discussion above. If you say get_utf32, the thinking was, then it looks like you are getting a UTF-32 code unit, when in reality you are passing one in as a parameter. (I don't know how I personally feel)

sffc commented 1 month ago

@Manishearth I'm fine with get and contains being in different namespaces
@sffc We have things like get32_u32
@sffc A difference between these and try_from_utf8 is that one works on code points and the other works on strings
@robertbastian - I think it's almost the same to take a sequence of unvalidated UTF-8 code units, or a single unvalidated UTF-32 code unit
@robertbastian - There's also next16, whose documentation isn't great: https://unicode-org.github.io/icu4x/rustdoc/icu/collections/char16trie/struct.Char16TrieIterator.html#method.next16
@sffc why do we even return unvalidated things?
@robertbastian - Seems like we should return the upgraded type and let people convert to a downgraded type.
@sffc There's also get32_ule which returns a reference, which we actually want.
@echeran It seems fine to me if get and contains are different signatures. contains always returns a bool.
@robertbastian We also have next and other methods. I would like if we wouldn't need to make this decision on each API. If we get rid of get32_u32, we can adopt my proposed naming scheme.
@sffc How would we name get32_ule?
@robertbastian Probably get_ule_utf32
@robertbastian We say "utf32" which is short for "potential_utf32"
@echeran It seems redundant to say "potential_utf32" for a single code unit; may as well just be "u32". If you want to communicate any further, you have docs.
@robertbastian - u32 is already in the signature, it's redundant to put it twice. Also, utf32 highlights that it's meant to be an unvalidated char, whereas a u32 could be a number or something
@sffc I don't really think that the decision we made in 2.0 for string functions should invalidate the decision we made in 1.0 for code point functions. A lot of the arguments being brought up here are the same ones that we had previously discussed.
@roberbastian - a utf32 suffix was not discussed back then and I think it's way better. Otherwise I'll need to go and write that the u32 is a potential UTF-32 code unit in all the docs

No conclusion yet.

markusicu commented 1 month ago

utf32 is a string encoding. u32 is one possible type for a code point.

sffc commented 1 month ago

@sffc We previously discussed the naming with regard to underscores. I don't see a great way to address that.
@robertbastian I think utf32 is more clear, but it's just an improvement, and if we don't have consensus, it's less work.

Conclusion:

Stick with get32-type naming in ICU4X 2.0
Discuss with @markusicu (not blocking 2.0) what to say in the docs:
- "a potentially ill-formed Unicode code point"
- "a UTF-32 code unit"
- "a potentially ill-formed UTF-32 code unit"

LGTM: @sffc @robertbastian

Manishearth commented 1 month ago

LGTM

unicode-org / icu4x

Evaluate consistency and naming of char vs u32 methods in icu_collections and icu_properties #2413