unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.
https://icu4x.unicode.org
Other
1.29k stars 165 forks source link

Finalize name and location of UnvalidatedX #3546

Open sffc opened 1 year ago

sffc commented 1 year ago

Currently we have zerovec::ule::UnvalidatedStr and zerovec::ule::UnvalidatedChar. For a while, we've been meaning to discuss a more final home and name for these types.

There's nothing really zerovec-specific about these types other than zerovec putting their use case more front and center. They are almost as useful for serde as they are for zerovec.

I'm not a huge fan of the "unvalidated" prefix; I would rather we avoid negations.

How about schrodinger::SchrödingerStr? (also re-exported without the diacritic)

Discuss with:

Optional:

ajtribick commented 1 year ago

I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.

robertbastian commented 1 year ago

I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.

Thanks for pointing this out, TIL 🤢

robertbastian commented 1 year ago

I would have also been opposed to the Schrödinger name on the grounds that it's too clever and requires thinking around three corners or knowing what the crate does already. I'd rather use something straightforward like serde_maybe::MaybeStr, however serde-maybe is already taken unfortunately.

skius commented 1 year ago

Has removing the layer of abstraction and using the raw bytes underneath directly already been discussed? AIUI, we use UnvalidatedX in cases where we just need something comparable to use as a key in a map, or where the validation cost only needs to be paid conditionally - map.get("my_key".to_raw()) instead of .to_unvalidated() seems fine for the former, and the latter could follow Rust APIs like https://doc.rust-lang.org/stable/std/str/fn.from_utf8.html# ?

robertbastian commented 1 year ago

Previous discussion and background why we want the extra layer: https://github.com/unicode-org/icu4x/issues/2489

sffc commented 1 year ago

Two main reasons for the abstraction:

  1. We can use serialize_bytes
  2. We can serialize as a string for readability in JSON
sffc commented 1 year ago

The current semantics of UnvalidatedStr are:

This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.

sffc commented 1 year ago

A name I've been using on my branch is BytesOrStr (could also be StrOrBytes). In that pattern, just prepend "BytesOr" or append "OrBytes" to the validated type name.

robertbastian commented 1 year ago

BytesOrStr sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.

skius commented 1 year ago

I agree with @robertbastian that the "Or" is too inclusive here. But that does make me think, StrAsBytes/BytesStr?

Otherwise maybe some non-negated antonyms of "trusted": QuestionableStr, ShadyStr?

robertbastian commented 1 year ago

Haha I love ShadyStr. SketchyStr could be another option.

skius commented 1 year ago

My problem with these alternatives is that (imo) they either require more thinking than UnvalidatedStr, e.g., SketchyStr, or they are somewhat ambiguous, e.g., MaybeStr - to me this sounds like the case where the value is not a Str is supported, like the Haskell Maybe (Rust's Option), or is there some precedent for using Maybe to mean Unvalidated?

Manishearth commented 1 year ago

I kinda like the Unvalidated prefix and don't feel bad about the negation. Could be DeferredStr instead (serdefer makes for a cute crate name in that case). Or LaterStr if you want to use a simpler word.

I do like the zerovec crate being nice and clean, but I do admit that I'm also not too happy about proliferating more util crates. No strong opinion there.

skius commented 1 year ago

I agree with @Manishearth about the naming, UnvalidatedStr perfectly describes what the type is. If we have to avoid the negation, I like LaterStr the most out of all alternatives discussed in this thread.

sffc commented 1 year ago

BytesOrStr sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.

This was a progression from my suggestion that it may be useful to rethink the semantics:

The current semantics of UnvalidatedStr are:

  • Serialization: bytes in binary, string or error in human-readable
  • Deserialization: bytes in binary, string in human-readable

This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.

In practice, there's very little difference between the functionality of UnvalidatedStr and BytesOrStr except for some edge cases around serialization. So why not just embrace the other model that is more flexible.

robertbastian commented 4 months ago

The unvalidated name is available. We could use unvalidated::Str and unvalidated::Char

Manishearth commented 4 months ago

Not opposed.

sffc commented 4 months ago

I still like SchrödingerStr modulo the issues with the namesake historical figure. Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?

sffc commented 4 months ago

Should we consider a name like PotentiallyIllFormedUtf8?

Is there a semantic difference?

robertbastian commented 4 months ago

Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?

I really really dislike all the "clever" names.

Manishearth commented 4 months ago

In this case I can't immediately think of anything and unvalidated is available and accurate, we should just use that imo.

sffc commented 4 months ago

I am not very happy with that proposal because:

  1. In the OP, I said I would prefer to avoid negated terms like un or dis or non. I prefer to start with a positive term.
  2. unvalidated::Str violates our style guide naming. We don't want to import unvalidated::Str and use Str at call sites; that is super confusing. We could need to either import it as unvalidated::Str as UnvalidatedStr or export unvalidated::UnvalidatedStr.
sffc commented 4 months ago

Bikeshed, with some help from gemini:

If we don't like any of those, how about choosing a letter of the alphabet, like

Most developers don't need to deeply understand why this type exists. They will most often see it as the key of a map. ZeroMap<XStr, PatternULE> seems fairly clear: the key is a string that is potentially a bit weird, and the value is a ULE pattern type.

Manishearth commented 4 months ago

I'm fine with unvalidated::UnvalidatedStr.

I don't really think any of the other names work well here. I think the un is warranted here.

sffc commented 4 months ago

What did you all think of PotentiallyIllFormedUtf8

"Potentially ill-formed UTF-8" is an established concept in the industry. Actually we already use it in icu_segmenter.

The type name is a bit long, so we could do (bikeshed):

I also observe that the crate name pill is available. How about pill::PillStr? 😃

"Pill" is aligned with our hospital theme of "ICU"

sffc commented 4 months ago

A potential advantage of this approach is that we could start having APIs such as

pub fn process(s: &str)
pub fn process_utf8(s: &PillStr)
pub fn process_utf16(s: &PillUtf16)
Manishearth commented 4 months ago

What did you all think of PotentiallyIllFormedUtf8

Not strongly opposed, feels long. Still prefer the current Unvalidated naming.

Not a fan of pill. I think it's okayish.

Manishearth commented 3 months ago

Discussion Conclusions:

  1. This lives in a crate named after the type (probably), that depends on serde and zerovec, with feature flags.
  2. Fine with using these for _utf8 and _utf16 APIs that accept potentially invalid stuff

LGTM: @manishearth, @sffc, @robertbastian

Name conclusion: bikeshed later (Manish and Rob prefer Unvalidated), also need to bikeshed _utf8 suffix

sffc commented 2 months ago

The crate should also have an optional dependency on writeable as discussed in https://github.com/unicode-org/icu4x/pull/4786#discussion_r1559627673

sffc commented 2 months ago

I really want to start the bikeshed for this. Let's focus just on the dynamically sized type that can be infallibly converted from a byte sequence, fallibly converted to a str, and includes impls for things like serde, zerovec, and writeable.

  1. potentially_ill_formed
    • 1a potentially_ill_formed::PotentiallyIllFormedStr, potentially_ill_formed::PotentiallyIllFormedUtf16
    • 1b potentially_ill_formed::PotentiallyIllFormedUtf8, potentially_ill_formed::PotentiallyIllFormedUtf16
  2. pill
    • 2a pill::PillStr, pill::Pill16
    • 2b pill::PillUtf8, pill::PillUtf16
    • 2c pill::Pill8, pill::Pill16
  3. unvalidated
    • 3a unvalidated::UnvalidatedStr, unvalidated::UnvalidatedUtf16
    • 3b unvalidated::UnvalidatedUtf8, unvalidated::UnvalidatedUtf16
  4. quasistr
    • 4a quasistr::QuasiStr, quasistr::Quasi16
    • 4b quasistr::QuasiUtf8, quasistr::QuasiUtf16
    • 4c quasistr::Quasi8, quasistr::Quasi16
  5. quasi_str
    • 5a quasi_str::QuasiStr, quasi_str::Quasi16
    • 5b quasi_str::QuasiUtf8, quasi_str::QuasiUtf16
    • 5b quasi_str::Quasi8, quasi_str::Quasi16

Any more suggestions before I send out a ballot? I will distribute it after this week's ICU4X-WG call.

robertbastian commented 2 months ago

We also need to find a name for the UTF-16 type, which I'd say rules out all the (a)s.

robertbastian commented 2 months ago
  1. maybe_utf::MaybeUtf8, maybe_utf::MaybeUtf16
  2. potential_utf::PotentialUtf8, potential_utf::PotentialUtf16
sffc commented 2 months ago

Proposed ballot language:

The ICU4X Working Group would like to introduce a crate containing a type that can be infallibly converted from [u8], fallibly converted to a str, and has impls for serde, zerovec, and writeable that assume the content is UTF-8 but may write replacement characters if the bytes are ill-formed. There will be a sibling type that works on [u16]. These types are intended to be used widely within data structs, FFI, and API as necessary. The names of fields, function parameters, and function names are out of scope of this bikeshed.

sffc commented 2 months ago

We also need to find a name for the UTF-16 type, which I'd say rules out all the (a)s.

I added the UTF-16 naming to the options. I don't necessarily feel the (a)s are automatically ruled out but you can vote them down in the ballot if you feel this way. There are some options I plan to be voting down.

Manishearth commented 2 months ago

I'm fine with the options so far.

echeran commented 2 months ago

When sitting down to vote and take a closer look at the options, I didn't like the choice of word / prefix in those options to indicate the uncertainty of the well-formedness of the UTF-{8, 16} encoding adhered to by the string.

Comments per option:

  1. PotentiallyIllFormedUtf{8,16} - the substring prefix PotentiallyIllFormed is 100% accurate but long. Since there are options that are shorter and imply the same meaning well enough, I think the length is too long.
  2. Pill - this doesn't make sense without knowing the phrase PotentiallyIllFormed and realizing that this is a shortening. Given that ICU4X has a lot of constituent crates that interact with each other, I think it's good to avoid codenames / puns / etc. if there are common words (or combinations thereof) that can describe the type/crate so that we don't create yet another bespoke language that makes it harder for new users to grok. Especially if it's for a "lower-level" type in the sense that it might occur in multiple places throughout the codebase.
  3. Unvalidated - I do not think that this is a good descriptor because it might imply that the string needs to be validated before it can be used, but this is not the case. The "validity" is in relation to UTF-{8,16} well-formedness, and the lack of preciseness causes enough ambiguity to be undesriable.
  4. Quasi has the same problem as in Option 3, but it's even less clear that "quasi" is in regards to the UTF well-formedness, as opposed to the validity of the string itself not being a true string (character sequence).
  5. MaybeUtf{8,16} is not bad because it is fairly short and it gets across fairly clearly what we're talking about (the uncertainty pertains to UTF)
  6. Potential is similar to Option 5 but a slightly longer and more unfamiliar term than Maybe when it comes to naming an identifier for a type/variable/function in code, AFAICT. I think Option 5 is strictly better than this option for that reason.

I want to offer a few more options to try to capture different angles. For example, when thinking about the type of element that we have a sequence of, we know more about it than u8 but we also can't assert that it has all of the aspects/constraints of char.

hsivonen commented 2 months ago

@sffc asked me to comment on the naming here:

I think "potentially ill-formed" is the most correct term, but it makes for very long identifiers, so as a matter of type naming, I prefer Unvalidated and put PotentiallyIllFormed second.

CodeUnits seem technically correct, but relative to Unvalidated doesn't emphasize that the point is the potential ill-formedness.

I think we shouldn't use Pill: it's not clear as an abbreviation, and I don't want the naming to be evocative of any slang or memes involving "pill".

Maybe is suggestive of type-wise either-or in the enum sense.

Quasi isn't clear on quasi in what sense.

Raw is suggestive of RawVec and the like. I think Raw in Rust libraries doesn't suggest that the type is allowed to violate invariants but that the implementation itself does not take care of upholding the invariants that are required to be upheld.

StrBytes has the problem that it is not, in fact, necessarily the bytes of str but the whole point is potentially holding bytes that aren't OK as str.

sffc commented 2 months ago

What I hear from the discussions here is that everyone is in general agreement that PotentiallyIllFormedUtf8 correctly conveys the semantics, but people prefer something shorter. We have 15 shorter names on the table (including variants), and we haven't been able to converge on one yet.

If I may try to summarize the pros and cons:


I'm still searching for the best compromise with the fewest cons. Doing this writeup makes me think that Potential might be a good compromise. The only downside of "potential" that has been expressed in this thread was Elango's comment that "potential" was strictly worse than "maybe"; however, since that comment was posted, Henri noted that "maybe" implies an enum-like semantic in Rust. I also like that "potential" is a shortening of the phrase "potentially ill-formed", which we all agree on.

Can anyone verbalize any more downsides to the Potential prefix?

Also, if I failed to note any pros or cons of the options, please reply to this issue.

Manishearth commented 2 months ago

I think the thing is, it still is a string, it's not "potentially not a string". PotentialStr makes me think it's an Option or something.

Unvalidated matches my mental model well because it is some type of string and we have not yet validated whether it is the type of string we like.

sffc commented 2 months ago

I think the thing is, it still is a string, it's not "potentially not a string". PotentialStr makes me think it's an Option or something.

The proposed name is PotentialUtf8; does that change your position?

Do you feel both Potential and Maybe imply the Option semantics equally, or does one of them imply it more strongly than the other?

Manishearth commented 2 months ago

Maybe is stronger (EDIT @sffc: Maybe more strongly suggests the Option semantics)

PotentialUtf8 seems ok to me

sffc commented 2 months ago

Unvalidated matches my mental model well because it is some type of string and we have not yet validated whether it is the type of string we like.

I added this to the "pros" of unvalidated.

sffc commented 2 months ago

As stated above in https://github.com/unicode-org/icu4x/issues/3546#issuecomment-2065122439, I'd like to propose the following as a compromise.

Crate Name: potential_utf

Primary types:

Potential (🤣) expansion types to be added if needed:

Although this was not anyone's first choice, it is the only option for which a fundamental flaw was not verbalized in this thread (except for perhaps quasi, whose objections I would characterize as more a distaste for the less-precise / clever naming convention). To read up on the flaws in every other option, see my post above.

Please check the box if you have no objection to the proposal.

echeran commented 2 months ago

I still prefer MaybeUtf{8,16} over PotentialUtf{8,16}. The word "maybe" says that it "could be X, it could also not be X", where here, X = valid UTF-{8,16} encoding form. The word "potential" has a similar meaning but also implies a time component: "in the future, X has the potential to become Y". Plus, the word "potential" is longer and less frequently used in programming identifiers in my experience.

Manishearth commented 2 months ago

Maybe has a strong connotation of being an Option in Rust.

sffc commented 2 months ago

Ok, we'll schedule a follow-up meeting. Alternatively, we could cover this in tomorrow's ICU4X-WG call assuming all the attendees are able to make it (@hsivonen @sffc @Manishearth @echeran).

sffc commented 2 months ago

I think there are basically two options:

  1. We agree as a group that the fully spelled out PotentiallyInvalidUtf8 is fine
  2. We go over the shorter options and choose the one whose downsides are least bad to us as a group
hsivonen commented 2 months ago

How will this relate to https://github.com/BurntSushi/bstr ?

Manishearth commented 2 months ago

@echeran is your comment a strong objection to Potential or a weak preference for Maybe?

I'm open to trying bstr instead. However it does have a non optional dependency on memchr.

sffc commented 2 months ago

My take: