Open sffc opened 1 year ago
I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.
I would suggest reading through the "Personal life" section of the Wikipedia article and the references therein before deciding whether to name more stuff after Erwin Schrödinger.
Thanks for pointing this out, TIL 🤢
I would have also been opposed to the Schrödinger name on the grounds that it's too clever and requires thinking around three corners or knowing what the crate does already. I'd rather use something straightforward like serde_maybe::MaybeStr
, however serde-maybe
is already taken unfortunately.
Has removing the layer of abstraction and using the raw bytes underneath directly already been discussed? AIUI, we use UnvalidatedX in cases where we just need something comparable to use as a key in a map, or where the validation cost only needs to be paid conditionally - map.get("my_key".to_raw()) instead of .to_unvalidated() seems fine for the former, and the latter could follow Rust APIs like https://doc.rust-lang.org/stable/std/str/fn.from_utf8.html# ?
Previous discussion and background why we want the extra layer: https://github.com/unicode-org/icu4x/issues/2489
Two main reasons for the abstraction:
The current semantics of UnvalidatedStr are:
This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.
A name I've been using on my branch is BytesOrStr
(could also be StrOrBytes
). In that pattern, just prepend "BytesOr" or append "OrBytes" to the validated type name.
BytesOrStr
sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.
I agree with @robertbastian that the "Or" is too inclusive here. But that does make me think, StrAsBytes
/BytesStr
?
Otherwise maybe some non-negated antonyms of "trusted": QuestionableStr
, ShadyStr
?
Haha I love ShadyStr
. SketchyStr
could be another option.
My problem with these alternatives is that (imo) they either require more thinking than UnvalidatedStr
, e.g., SketchyStr
, or they are somewhat ambiguous, e.g., MaybeStr
- to me this sounds like the case where the value is not a Str is supported, like the Haskell Maybe (Rust's Option), or is there some precedent for using Maybe to mean Unvalidated?
I kinda like the Unvalidated prefix and don't feel bad about the negation. Could be DeferredStr instead (serdefer
makes for a cute crate name in that case). Or LaterStr
if you want to use a simpler word.
I do like the zerovec crate being nice and clean, but I do admit that I'm also not too happy about proliferating more util crates. No strong opinion there.
I agree with @Manishearth about the naming, UnvalidatedStr
perfectly describes what the type is. If we have to avoid the negation, I like LaterStr
the most out of all alternatives discussed in this thread.
BytesOrStr
sounds like it's an enum and that having bytes is actually fine, when in fact it's an error/gigo case.
This was a progression from my suggestion that it may be useful to rethink the semantics:
The current semantics of UnvalidatedStr are:
- Serialization: bytes in binary, string or error in human-readable
- Deserialization: bytes in binary, string in human-readable
This is somewhat handy because it means we can rely on an UnvalidatedStr to be the key of a map (or else hit a serializer error). But, another useful semantic for data that is expected to be either string or bytes would be to serialize as a byte sequence in human-readable if the bytes are not ASCII.
In practice, there's very little difference between the functionality of UnvalidatedStr and BytesOrStr except for some edge cases around serialization. So why not just embrace the other model that is more flexible.
The unvalidated
name is available. We could use unvalidated::Str
and unvalidated::Char
Not opposed.
I still like SchrödingerStr modulo the issues with the namesake historical figure. Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?
Should we consider a name like PotentiallyIllFormedUtf8?
Is there a semantic difference?
Can't we come up with some other "clever" term a la elsa::FrozenMap, yoke::Yokeable, ... ?
I really really dislike all the "clever" names.
In this case I can't immediately think of anything and unvalidated
is available and accurate, we should just use that imo.
I am not very happy with that proposal because:
un
or dis
or non
. I prefer to start with a positive term.unvalidated::Str
violates our style guide naming. We don't want to import unvalidated::Str
and use Str
at call sites; that is super confusing. We could need to either import it as unvalidated::Str as UnvalidatedStr
or export unvalidated::UnvalidatedStr
.Bikeshed, with some help from gemini:
RawStr
DeferredStr
QuasiStr
ChameleonStr
ProspectStr
OptimisticStr
PotentialStr
If we don't like any of those, how about choosing a letter of the alphabet, like
WStr
XStr
YStr
ZStr
Most developers don't need to deeply understand why this type exists. They will most often see it as the key of a map. ZeroMap<XStr, PatternULE>
seems fairly clear: the key is a string that is potentially a bit weird, and the value is a ULE pattern type.
I'm fine with unvalidated::UnvalidatedStr
.
I don't really think any of the other names work well here. I think the un is warranted here.
What did you all think of PotentiallyIllFormedUtf8
"Potentially ill-formed UTF-8" is an established concept in the industry. Actually we already use it in icu_segmenter
.
The type name is a bit long, so we could do (bikeshed):
potentially_ill_formed::PifStr
potentially_ill_formed::PillStr
potentially_ill_formed::PIllStr
I also observe that the crate name pill
is available. How about pill::PillStr
? 😃
"Pill" is aligned with our hospital theme of "ICU"
A potential advantage of this approach is that we could start having APIs such as
pub fn process(s: &str)
pub fn process_utf8(s: &PillStr)
pub fn process_utf16(s: &PillUtf16)
What did you all think of
PotentiallyIllFormedUtf8
Not strongly opposed, feels long. Still prefer the current Unvalidated
naming.
Not a fan of pill
. I think it's okayish.
Discussion Conclusions:
serde
and zerovec
, with feature flags._utf8
and _utf16
APIs that accept potentially invalid stuffLGTM: @manishearth, @sffc, @robertbastian
Name conclusion: bikeshed later (Manish and Rob prefer Unvalidated), also need to bikeshed _utf8
suffix
The crate should also have an optional dependency on writeable
as discussed in https://github.com/unicode-org/icu4x/pull/4786#discussion_r1559627673
I really want to start the bikeshed for this. Let's focus just on the dynamically sized type that can be infallibly converted from a byte sequence, fallibly converted to a str
, and includes impls for things like serde, zerovec, and writeable.
potentially_ill_formed
potentially_ill_formed::PotentiallyIllFormedStr
, potentially_ill_formed::PotentiallyIllFormedUtf16
potentially_ill_formed::PotentiallyIllFormedUtf8
, potentially_ill_formed::PotentiallyIllFormedUtf16
pill
pill::PillStr
, pill::Pill16
pill::PillUtf8
, pill::PillUtf16
pill::Pill8
, pill::Pill16
unvalidated
unvalidated::UnvalidatedStr
, unvalidated::UnvalidatedUtf16
unvalidated::UnvalidatedUtf8
, unvalidated::UnvalidatedUtf16
quasistr
quasistr::QuasiStr
, quasistr::Quasi16
quasistr::QuasiUtf8
, quasistr::QuasiUtf16
quasistr::Quasi8
, quasistr::Quasi16
quasi_str
quasi_str::QuasiStr
, quasi_str::Quasi16
quasi_str::QuasiUtf8
, quasi_str::QuasiUtf16
quasi_str::Quasi8
, quasi_str::Quasi16
Any more suggestions before I send out a ballot? I will distribute it after this week's ICU4X-WG call.
We also need to find a name for the UTF-16 type, which I'd say rules out all the (a)s.
maybe_utf::MaybeUtf8
, maybe_utf::MaybeUtf16
potential_utf::PotentialUtf8
, potential_utf::PotentialUtf16
Proposed ballot language:
The ICU4X Working Group would like to introduce a crate containing a type that can be infallibly converted from [u8]
, fallibly converted to a str
, and has impls for serde
, zerovec
, and writeable
that assume the content is UTF-8 but may write replacement characters if the bytes are ill-formed. There will be a sibling type that works on [u16]
. These types are intended to be used widely within data structs, FFI, and API as necessary. The names of fields, function parameters, and function names are out of scope of this bikeshed.
We also need to find a name for the UTF-16 type, which I'd say rules out all the (a)s.
I added the UTF-16 naming to the options. I don't necessarily feel the (a)s are automatically ruled out but you can vote them down in the ballot if you feel this way. There are some options I plan to be voting down.
I'm fine with the options so far.
When sitting down to vote and take a closer look at the options, I didn't like the choice of word / prefix in those options to indicate the uncertainty of the well-formedness of the UTF-{8, 16} encoding adhered to by the string.
Comments per option:
PotentiallyIllFormedUtf{8,16}
- the substring prefix PotentiallyIllFormed
is 100% accurate but long. Since there are options that are shorter and imply the same meaning well enough, I think the length is too long.Pill
- this doesn't make sense without knowing the phrase PotentiallyIllFormed
and realizing that this is a shortening. Given that ICU4X has a lot of constituent crates that interact with each other, I think it's good to avoid codenames / puns / etc. if there are common words (or combinations thereof) that can describe the type/crate so that we don't create yet another bespoke language that makes it harder for new users to grok. Especially if it's for a "lower-level" type in the sense that it might occur in multiple places throughout the codebase.Unvalidated
- I do not think that this is a good descriptor because it might imply that the string needs to be validated before it can be used, but this is not the case. The "validity" is in relation to UTF-{8,16} well-formedness, and the lack of preciseness causes enough ambiguity to be undesriable.Quasi
has the same problem as in Option 3, but it's even less clear that "quasi" is in regards to the UTF well-formedness, as opposed to the validity of the string itself not being a true string (character sequence).MaybeUtf{8,16}
is not bad because it is fairly short and it gets across fairly clearly what we're talking about (the uncertainty pertains to UTF)Potential
is similar to Option 5 but a slightly longer and more unfamiliar term than Maybe
when it comes to naming an identifier for a type/variable/function in code, AFAICT. I think Option 5 is strictly better than this option for that reason.I want to offer a few more options to try to capture different angles. For example, when thinking about the type of element that we have a sequence of, we know more about it than u8
but we also can't assert that it has all of the aspects/constraints of char
.
StrBytes
, StrBytes16
- it concisely says "these are bytes for a string". Its existence would imply some extra criterion that couldn't be satisfied by just [u8]
or String
.Raw
- I've seen the adjective "raw" used as a prefix for identifiers, especially types, to indicate some initial form prior to validation or processing. Very concise.CodeUnits
- this is the Unicode terminology for u8
and u16
for a character sequence that is attempting an encoding of UTF-8 or UTF-16. The term "code unit" doesn't imply that they're all part of a well-formed UTF-{8,16} string, and the term isn't too long. The prefix Raw
could be added to make that latter point especially clear.@sffc asked me to comment on the naming here:
I think "potentially ill-formed" is the most correct term, but it makes for very long identifiers, so as a matter of type naming, I prefer Unvalidated
and put PotentiallyIllFormed
second.
CodeUnits
seem technically correct, but relative to Unvalidated
doesn't emphasize that the point is the potential ill-formedness.
I think we shouldn't use Pill
: it's not clear as an abbreviation, and I don't want the naming to be evocative of any slang or memes involving "pill".
Maybe
is suggestive of type-wise either-or in the enum
sense.
Quasi
isn't clear on quasi in what sense.
Raw
is suggestive of RawVec
and the like. I think Raw
in Rust libraries doesn't suggest that the type is allowed to violate invariants but that the implementation itself does not take care of upholding the invariants that are required to be upheld.
StrBytes
has the problem that it is not, in fact, necessarily the bytes of str
but the whole point is potentially holding bytes that aren't OK as str
.
What I hear from the discussions here is that everyone is in general agreement that PotentiallyIllFormedUtf8
correctly conveys the semantics, but people prefer something shorter. We have 15 shorter names on the table (including variants), and we haven't been able to converge on one yet.
If I may try to summarize the pros and cons:
Pill*
Unvalidated*
Quasi*
Maybe*
maybe_utf8
with different usage semanticsPotential*
StrBytes*
str::as_bytes()
Raw*
RawVec
; might imply the wrong semanticsCodeUnits*
I'm still searching for the best compromise with the fewest cons. Doing this writeup makes me think that Potential
might be a good compromise. The only downside of "potential" that has been expressed in this thread was Elango's comment that "potential" was strictly worse than "maybe"; however, since that comment was posted, Henri noted that "maybe" implies an enum-like semantic in Rust. I also like that "potential" is a shortening of the phrase "potentially ill-formed", which we all agree on.
Can anyone verbalize any more downsides to the Potential
prefix?
Also, if I failed to note any pros or cons of the options, please reply to this issue.
I think the thing is, it still is a string, it's not "potentially not a string". PotentialStr makes me think it's an Option or something.
Unvalidated matches my mental model well because it is some type of string and we have not yet validated whether it is the type of string we like.
I think the thing is, it still is a string, it's not "potentially not a string". PotentialStr makes me think it's an Option or something.
The proposed name is PotentialUtf8
; does that change your position?
Do you feel both Potential
and Maybe
imply the Option semantics equally, or does one of them imply it more strongly than the other?
Maybe is stronger (EDIT @sffc: Maybe
more strongly suggests the Option semantics)
PotentialUtf8 seems ok to me
Unvalidated matches my mental model well because it is some type of string and we have not yet validated whether it is the type of string we like.
I added this to the "pros" of unvalidated.
As stated above in https://github.com/unicode-org/icu4x/issues/3546#issuecomment-2065122439, I'd like to propose the following as a compromise.
Crate Name: potential_utf
Primary types:
pub struct PotentialUtf8(pub [u8])
pub struct PotentialUtf16(pub [u16])
Potential (🤣) expansion types to be added if needed:
pub struct PotentialCodePoint(pub [u8; 3])
// a 24-bit typepub struct PotentialUtf16Bytes(pub ZeroSlice<u16>)
// a ULE version of PotentialUtf16
Although this was not anyone's first choice, it is the only option for which a fundamental flaw was not verbalized in this thread (except for perhaps quasi, whose objections I would characterize as more a distaste for the less-precise / clever naming convention). To read up on the flaws in every other option, see my post above.
Please check the box if you have no objection to the proposal.
I still prefer MaybeUtf{8,16}
over PotentialUtf{8,16}
. The word "maybe" says that it "could be X, it could also not be X", where here, X = valid UTF-{8,16} encoding form. The word "potential" has a similar meaning but also implies a time component: "in the future, X has the potential to become Y". Plus, the word "potential" is longer and less frequently used in programming identifiers in my experience.
Maybe has a strong connotation of being an Option
in Rust.
Ok, we'll schedule a follow-up meeting. Alternatively, we could cover this in tomorrow's ICU4X-WG call assuming all the attendees are able to make it (@hsivonen @sffc @Manishearth @echeran).
I think there are basically two options:
PotentiallyInvalidUtf8
is fineHow will this relate to https://github.com/BurntSushi/bstr ?
@echeran is your comment a strong objection to Potential
or a weak preference for Maybe
?
I'm open to trying bstr instead. However it does have a non optional dependency on memchr.
My take:
potential
, the temporal component of it. I have added this to the list of downsides in https://github.com/unicode-org/icu4x/issues/3546#issuecomment-2065122439.bstr
: thanks for raising this! I know I had spent a fair bit of time searching crates.io for a crate that did this, but I never found bstr
. I've worked with BurntSushi before. If we could collaborate to make sure all of our needs are met (making memchr
an optional dependency, adding impls for Writeable
, ...), then I think bstr
is a really solid option.
Currently we have zerovec::ule::UnvalidatedStr and zerovec::ule::UnvalidatedChar. For a while, we've been meaning to discuss a more final home and name for these types.
There's nothing really zerovec-specific about these types other than zerovec putting their use case more front and center. They are almost as useful for serde as they are for zerovec.
I'm not a huge fan of the "unvalidated" prefix; I would rather we avoid negations.
How about
schrodinger::SchrödingerStr
? (also re-exported without the diacritic)Discuss with:
Optional: