Finalize name and location of UnvalidatedX

sffc commented 1 year ago

Currently we have zerovec::ule::UnvalidatedStr and zerovec::ule::UnvalidatedChar. For a while, we've been meaning to discuss a more final home and name for these types.

There's nothing really zerovec-specific about these types other than zerovec putting their use case more front and center. They are almost as useful for serde as they are for zerovec.

I'm not a huge fan of the "unvalidated" prefix; I would rather we avoid negations.

How about schrodinger::SchrödingerStr? (also re-exported without the diacritic)

Discuss with:

@Manishearth
@robertbastian
@sffc

Optional:

@echeran
@skius

ajtribick commented 1 year ago

I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.

robertbastian commented 1 year ago

I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.

Thanks for pointing this out, TIL 🤢

robertbastian commented 1 year ago

I would have also been opposed to the Schrödinger name on the grounds that it's too clever and requires thinking around three corners or knowing what the crate does already. I'd rather use something straightforward like serde_maybe::MaybeStr, however serde-maybe is already taken unfortunately.

skius commented 1 year ago

Has removing the layer of abstraction and using the raw bytes underneath directly already been discussed? AIUI, we use UnvalidatedX in cases where we just need something comparable to use as a key in a map, or where the validation cost only needs to be paid conditionally - map.get("my_key".to_raw()) instead of .to_unvalidated() seems fine for the former, and the latter could follow Rust APIs like https://doc.rust-lang.org/stable/std/str/fn.from_utf8.html# ?

robertbastian commented 1 year ago

Previous discussion and background why we want the extra layer: https://github.com/unicode-org/icu4x/issues/2489

sffc commented 1 year ago

Two main reasons for the abstraction:

We can use serialize_bytes
We can serialize as a string for readability in JSON

sffc commented 1 year ago

The current semantics of UnvalidatedStr are:

Serialization: bytes in binary, string or error in human-readable
Deserialization: bytes in binary, string in human-readable

This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.

sffc commented 1 year ago

A name I've been using on my branch is BytesOrStr (could also be StrOrBytes). In that pattern, just prepend "BytesOr" or append "OrBytes" to the validated type name.

robertbastian commented 1 year ago

BytesOrStr sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.

skius commented 1 year ago

I agree with @robertbastian that the "Or" is too inclusive here. But that does make me think, StrAsBytes/BytesStr?

Otherwise maybe some non-negated antonyms of "trusted": QuestionableStr, ShadyStr?

robertbastian commented 1 year ago

Haha I love ShadyStr. SketchyStr could be another option.

skius commented 1 year ago

My problem with these alternatives is that (imo) they either require more thinking than UnvalidatedStr, e.g., SketchyStr, or they are somewhat ambiguous, e.g., MaybeStr - to me this sounds like the case where the value is not a Str is supported, like the Haskell Maybe (Rust's Option), or is there some precedent for using Maybe to mean Unvalidated?

Manishearth commented 1 year ago

I kinda like the Unvalidated prefix and don't feel bad about the negation. Could be DeferredStr instead (serdefer makes for a cute crate name in that case). Or LaterStr if you want to use a simpler word.

I do like the zerovec crate being nice and clean, but I do admit that I'm also not too happy about proliferating more util crates. No strong opinion there.

skius commented 1 year ago

I agree with @Manishearth about the naming, UnvalidatedStr perfectly describes what the type is. If we have to avoid the negation, I like LaterStr the most out of all alternatives discussed in this thread.

sffc commented 1 year ago

BytesOrStr sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.

This was a progression from my suggestion that it may be useful to rethink the semantics:

The current semantics of UnvalidatedStr are:

Serialization: bytes in binary, string or error in human-readable

Deserialization: bytes in binary, string in human-readable

This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.

In practice, there's very little difference between the functionality of UnvalidatedStr and BytesOrStr except for some edge cases around serialization. So why not just embrace the other model that is more flexible.

robertbastian commented 4 months ago

The unvalidated name is available. We could use unvalidated::Str and unvalidated::Char

Manishearth commented 4 months ago

Not opposed.

sffc commented 4 months ago

I still like SchrödingerStr modulo the issues with the namesake historical figure. Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?

sffc commented 4 months ago

Should we consider a name like PotentiallyIllFormedUtf8?

Is there a semantic difference?

robertbastian commented 4 months ago

Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?

I really really dislike all the "clever" names.

Manishearth commented 4 months ago

In this case I can't immediately think of anything and unvalidated is available and accurate, we should just use that imo.

sffc commented 4 months ago

I am not very happy with that proposal because:

In the OP, I said I would prefer to avoid negated terms like un or dis or non. I prefer to start with a positive term.
unvalidated::Str violates our style guide naming. We don't want to import unvalidated::Str and use Str at call sites; that is super confusing. We could need to either import it as unvalidated::Str as UnvalidatedStr or export unvalidated::UnvalidatedStr.

sffc commented 4 months ago

Bikeshed, with some help from gemini:

RawStr
DeferredStr
QuasiStr
ChameleonStr
ProspectStr
OptimisticStr
PotentialStr

If we don't like any of those, how about choosing a letter of the alphabet, like

WStr
XStr
YStr
ZStr

Most developers don't need to deeply understand why this type exists. They will most often see it as the key of a map. ZeroMap<XStr, PatternULE> seems fairly clear: the key is a string that is potentially a bit weird, and the value is a ULE pattern type.

Manishearth commented 4 months ago

I'm fine with unvalidated::UnvalidatedStr.

I don't really think any of the other names work well here. I think the un is warranted here.

sffc commented 4 months ago

What did you all think of PotentiallyIllFormedUtf8

"Potentially ill-formed UTF-8" is an established concept in the industry. Actually we already use it in icu_segmenter.

The type name is a bit long, so we could do (bikeshed):

potentially_ill_formed::PifStr
potentially_ill_formed::PillStr
potentially_ill_formed::PIllStr

I also observe that the crate name pill is available. How about pill::PillStr? 😃

"Pill" is aligned with our hospital theme of "ICU"

sffc commented 4 months ago

A potential advantage of this approach is that we could start having APIs such as

pub fn process(s: &str)
pub fn process_utf8(s: &PillStr)
pub fn process_utf16(s: &PillUtf16)

Manishearth commented 4 months ago

What did you all think of PotentiallyIllFormedUtf8

Not strongly opposed, feels long. Still prefer the current Unvalidated naming.

Not a fan of pill. I think it's okayish.

Manishearth commented 3 months ago

Discussion Conclusions:

This lives in a crate named after the type (probably), that depends on serde and zerovec, with feature flags.
Fine with using these for _utf8 and _utf16 APIs that accept potentially invalid stuff

LGTM: @manishearth, @sffc, @robertbastian

Name conclusion: bikeshed later (Manish and Rob prefer Unvalidated), also need to bikeshed _utf8 suffix

sffc commented 2 months ago

The crate should also have an optional dependency on writeable as discussed in https://github.com/unicode-org/icu4x/pull/4786#discussion_r1559627673

sffc commented 2 months ago

I really want to start the bikeshed for this. Let's focus just on the dynamically sized type that can be infallibly converted from a byte sequence, fallibly converted to a str, and includes impls for things like serde, zerovec, and writeable.

potentially_ill_formed
- 1a potentially_ill_formed::PotentiallyIllFormedStr, potentially_ill_formed::PotentiallyIllFormedUtf16
- 1b potentially_ill_formed::PotentiallyIllFormedUtf8, potentially_ill_formed::PotentiallyIllFormedUtf16
pill
- 2a pill::PillStr, pill::Pill16
- 2b pill::PillUtf8, pill::PillUtf16
- 2c pill::Pill8, pill::Pill16
unvalidated
- 3a unvalidated::UnvalidatedStr, unvalidated::UnvalidatedUtf16
- 3b unvalidated::UnvalidatedUtf8, unvalidated::UnvalidatedUtf16
quasistr
- 4a quasistr::QuasiStr, quasistr::Quasi16
- 4b quasistr::QuasiUtf8, quasistr::QuasiUtf16
- 4c quasistr::Quasi8, quasistr::Quasi16
quasi_str
- 5a quasi_str::QuasiStr, quasi_str::Quasi16
- 5b quasi_str::QuasiUtf8, quasi_str::QuasiUtf16
- 5b quasi_str::Quasi8, quasi_str::Quasi16

Any more suggestions before I send out a ballot? I will distribute it after this week's ICU4X-WG call.

robertbastian commented 2 months ago

We also need to find a name for the UTF-16 type, which I'd say rules out all the (a)s.

robertbastian commented 2 months ago

maybe_utf::MaybeUtf8, maybe_utf::MaybeUtf16
potential_utf::PotentialUtf8, potential_utf::PotentialUtf16

sffc commented 2 months ago

Proposed ballot language:

The ICU4X Working Group would like to introduce a crate containing a type that can be infallibly converted from [u8], fallibly converted to a str, and has impls for serde, zerovec, and writeable that assume the content is UTF-8 but may write replacement characters if the bytes are ill-formed. There will be a sibling type that works on [u16]. These types are intended to be used widely within data structs, FFI, and API as necessary. The names of fields, function parameters, and function names are out of scope of this bikeshed.

sffc commented 2 months ago

We also need to find a name for the UTF-16 type, which I'd say rules out all the (a)s.

I added the UTF-16 naming to the options. I don't necessarily feel the (a)s are automatically ruled out but you can vote them down in the ballot if you feel this way. There are some options I plan to be voting down.

Manishearth commented 2 months ago

I'm fine with the options so far.

echeran commented 2 months ago

When sitting down to vote and take a closer look at the options, I didn't like the choice of word / prefix in those options to indicate the uncertainty of the well-formedness of the UTF-{8, 16} encoding adhered to by the string.

Comments per option:

PotentiallyIllFormedUtf{8,16} - the substring prefix PotentiallyIllFormed is 100% accurate but long. Since there are options that are shorter and imply the same meaning well enough, I think the length is too long.
Pill - this doesn't make sense without knowing the phrase PotentiallyIllFormed and realizing that this is a shortening. Given that ICU4X has a lot of constituent crates that interact with each other, I think it's good to avoid codenames / puns / etc. if there are common words (or combinations thereof) that can describe the type/crate so that we don't create yet another bespoke language that makes it harder for new users to grok. Especially if it's for a "lower-level" type in the sense that it might occur in multiple places throughout the codebase.
Unvalidated - I do not think that this is a good descriptor because it might imply that the string needs to be validated before it can be used, but this is not the case. The "validity" is in relation to UTF-{8,16} well-formedness, and the lack of preciseness causes enough ambiguity to be undesriable.
Quasi has the same problem as in Option 3, but it's even less clear that "quasi" is in regards to the UTF well-formedness, as opposed to the validity of the string itself not being a true string (character sequence).
MaybeUtf{8,16} is not bad because it is fairly short and it gets across fairly clearly what we're talking about (the uncertainty pertains to UTF)
Potential is similar to Option 5 but a slightly longer and more unfamiliar term than Maybe when it comes to naming an identifier for a type/variable/function in code, AFAICT. I think Option 5 is strictly better than this option for that reason.

I want to offer a few more options to try to capture different angles. For example, when thinking about the type of element that we have a sequence of, we know more about it than u8 but we also can't assert that it has all of the aspects/constraints of char.

StrBytes, StrBytes16 - it concisely says "these are bytes for a string". Its existence would imply some extra criterion that couldn't be satisfied by just [u8] or String.
Raw - I've seen the adjective "raw" used as a prefix for identifiers, especially types, to indicate some initial form prior to validation or processing. Very concise.
CodeUnits - this is the Unicode terminology for u8 and u16 for a character sequence that is attempting an encoding of UTF-8 or UTF-16. The term "code unit" doesn't imply that they're all part of a well-formed UTF-{8,16} string, and the term isn't too long. The prefix Raw could be added to make that latter point especially clear.

hsivonen commented 2 months ago

@sffc asked me to comment on the naming here:

I think "potentially ill-formed" is the most correct term, but it makes for very long identifiers, so as a matter of type naming, I prefer Unvalidated and put PotentiallyIllFormed second.

CodeUnits seem technically correct, but relative to Unvalidated doesn't emphasize that the point is the potential ill-formedness.

I think we shouldn't use Pill: it's not clear as an abbreviation, and I don't want the naming to be evocative of any slang or memes involving "pill".

Maybe is suggestive of type-wise either-or in the enum sense.

Quasi isn't clear on quasi in what sense.

Raw is suggestive of RawVec and the like. I think Raw in Rust libraries doesn't suggest that the type is allowed to violate invariants but that the implementation itself does not take care of upholding the invariants that are required to be upheld.

StrBytes has the problem that it is not, in fact, necessarily the bytes of str but the whole point is potentially holding bytes that aren't OK as str.

sffc commented 2 months ago

What I hear from the discussions here is that everyone is in general agreement that PotentiallyIllFormedUtf8 correctly conveys the semantics, but people prefer something shorter. We have 15 shorter names on the table (including variants), and we haven't been able to converge on one yet.

If I may try to summarize the pros and cons:

Pill*
- Pros: Short; derived from "Potentially ill-formed"; thematic
- Cons: Too bespoke; evocative of slang terms
Unvalidated*
- Pros: Clear; suggests that the string is in a specific state; suggests it is some type of string and we have not yet validated whether it is the type of string we are expecting
- Cons: Implies that the string must first be validated, precluding algorithms designed to operate on strings in this state; primarily based on Serde's way of thinking; focuses on what the string is not rather than what it is
Quasi*
- Pros: Short; nice-sounding; dictionary definition aligns with semantic
- Cons: Unclear meaning; in science, usually means "fake, almost but not": quasicrystalline
Maybe*
- Pros: Clear; might imply a union type
- Cons: Might imply an either-or enum like an Option; there is an existing crate maybe_utf8 with different usage semantics
Potential*
- Pros: Clear; might imply statefulness
- Cons: Slightly long and unfamiliar term; might imply an either-or enum like Option, but not strongly; might imply a time component: "in the future, X has the potential to become Y"
StrBytes*
- Pros: Clear; corresponds to return value of str::as_bytes()
- Cons: Name does not suggest that the bytes might be ill-formed; UTF-16 is not composed of "bytes"
Raw*
- Pros: Short; suggests a state, similar to "unvalidated"; less opinionated
- Cons: The term "raw" is overloaded in contexts such as RawVec; might imply the wrong semantics
CodeUnits*
- Pros: Aligned with Unicode terminology
- Cons: Doesn't emphasize potential ill-formedness; in environments such as JavaScript, "code units" most commonly refers to UTF-16 code units as opposed to UTF-8 code units

I'm still searching for the best compromise with the fewest cons. Doing this writeup makes me think that Potential might be a good compromise. The only downside of "potential" that has been expressed in this thread was Elango's comment that "potential" was strictly worse than "maybe"; however, since that comment was posted, Henri noted that "maybe" implies an enum-like semantic in Rust. I also like that "potential" is a shortening of the phrase "potentially ill-formed", which we all agree on.

Can anyone verbalize any more downsides to the Potential prefix?

Also, if I failed to note any pros or cons of the options, please reply to this issue.

Manishearth commented 2 months ago

I think the thing is, it still is a string, it's not "potentially not a string". PotentialStr makes me think it's an Option or something.

Unvalidated matches my mental model well because it is some type of string and we have not yet validated whether it is the type of string we like.

sffc commented 2 months ago

I think the thing is, it still is a string, it's not "potentially not a string". PotentialStr makes me think it's an Option or something.

The proposed name is PotentialUtf8; does that change your position?

Do you feel both Potential and Maybe imply the Option semantics equally, or does one of them imply it more strongly than the other?

Manishearth commented 2 months ago

Maybe is stronger (EDIT @sffc: Maybe more strongly suggests the Option semantics)

PotentialUtf8 seems ok to me

sffc commented 2 months ago

Unvalidated matches my mental model well because it is some type of string and we have not yet validated whether it is the type of string we like.

I added this to the "pros" of unvalidated.

sffc commented 2 months ago

As stated above in https://github.com/unicode-org/icu4x/issues/3546#issuecomment-2065122439, I'd like to propose the following as a compromise.

Crate Name: potential_utf

Primary types:

pub struct PotentialUtf8(pub [u8])
pub struct PotentialUtf16(pub [u16])

Potential (🤣) expansion types to be added if needed:

pub struct PotentialCodePoint(pub [u8; 3]) // a 24-bit type
pub struct PotentialUtf16Bytes(pub ZeroSlice<u16>) // a ULE version of PotentialUtf16

Although this was not anyone's first choice, it is the only option for which a fundamental flaw was not verbalized in this thread (except for perhaps quasi, whose objections I would characterize as more a distaste for the less-precise / clever naming convention). To read up on the flaws in every other option, see my post above.

Please check the box if you have no objection to the proposal.

[ ] @hsivonen
[x] @sffc
[x] @Manishearth
[x] @echeran
[x] (@robertbastian)

echeran commented 2 months ago

I still prefer MaybeUtf{8,16} over PotentialUtf{8,16}. The word "maybe" says that it "could be X, it could also not be X", where here, X = valid UTF-{8,16} encoding form. The word "potential" has a similar meaning but also implies a time component: "in the future, X has the potential to become Y". Plus, the word "potential" is longer and less frequently used in programming identifiers in my experience.

Manishearth commented 2 months ago

Maybe has a strong connotation of being an Option in Rust.

sffc commented 2 months ago

Ok, we'll schedule a follow-up meeting. Alternatively, we could cover this in tomorrow's ICU4X-WG call assuming all the attendees are able to make it (@hsivonen @sffc @Manishearth @echeran).

sffc commented 2 months ago

I think there are basically two options:

We agree as a group that the fully spelled out PotentiallyInvalidUtf8 is fine
We go over the shorter options and choose the one whose downsides are least bad to us as a group

hsivonen commented 2 months ago

How will this relate to https://github.com/BurntSushi/bstr ?

Manishearth commented 2 months ago

@echeran is your comment a strong objection to Potential or a weak preference for Maybe?

I'm open to trying bstr instead. However it does have a non optional dependency on memchr.

sffc commented 2 months ago

My take:

@echeran's comment raises an additional downside to potential, the temporal component of it. I have added this to the list of downsides in https://github.com/unicode-org/icu4x/issues/3546#issuecomment-2065122439.
In general, in the context of the current discussion, arguments of the form "X is better than Y so we should choose X and not Y" does not help us reach a conclusion, but verbalizing pros and cons does help.
I think a factor we haven't explicitly considered is whether this is a Rust-specific abstraction or something we also use over FFI. If we name the type in FFI bindings, concerns about general intelligibility should be given a greater amount of weight relative to concerns about Rust conventions. If it is a Rust-specific abstraction, then the Rust conventions should take greater weight.
In this case, what's being proposed is a Rust crate. I'm not proposing any change to the FFI surface. Over FFI, we have abstractions that allow targets to declare whether their string types are guaranteed valid UTF-8 or not, and then the appropriate Rust API is chosen for them.
Regarding bstr: thanks for raising this! I know I had spent a fair bit of time searching crates.io for a crate that did this, but I never found bstr. I've worked with BurntSushi before. If we could collaborate to make sure all of our needs are met (making memchr an optional dependency, adding impls for Writeable, ...), then I think bstr is a really solid option.

unicode-org / icu4x

Finalize name and location of UnvalidatedX #3546