Finalize name and location of UnvalidatedX

unicode-org / icu4x

Solving i18n for client-side and resource-constrained environments.

https://icu4x.unicode.org

Other

1.29k stars 166 forks source link

Finalize name and location of UnvalidatedX #3546

Open sffc opened 1 year ago

sffc commented 1 year ago

Currently we have zerovec::ule::UnvalidatedStr and zerovec::ule::UnvalidatedChar. For a while, we've been meaning to discuss a more final home and name for these types.

There's nothing really zerovec-specific about these types other than zerovec putting their use case more front and center. They are almost as useful for serde as they are for zerovec.

I'm not a huge fan of the "unvalidated" prefix; I would rather we avoid negations.

How about schrodinger::SchrödingerStr? (also re-exported without the diacritic)

Discuss with:

@Manishearth
@robertbastian
@sffc

Optional:

@echeran
@skius

Manishearth commented 2 months ago

In general, in the context of the current discussion, arguments of the form "X is better than Y so we should choose X and not Y"

Yes, strong agree. I'd like us to be talking about pros and cons, not comparing at this stage, because a comparison is not in and of itself a blocking argument.

I think a factor we haven't explicitly considered is whether this is a Rust-specific abstraction or something we also use over FFI.

You address this later, but to explicitly talk about this: We already use DiplomatStr over FFI, and that translates to something natively meaningful on the other side. Which means it's unlikely this abstraction will ever "escape" over FFI.

hsivonen commented 2 months ago

I think there are basically two options:
1. We agree as a group that the fully spelled out `PotentiallyInvalidUtf8` is fine

After thinking about this more, I'm OK with this. We have IDE autocomplete, etc., to deal with the identifier length. PotentialUtf8 works, too.

After a look at the bstr issues, it's probably better to keep this in ICU4X and not try to change btsr to fit ICU4X.

sffc commented 2 months ago

Summary of discussion with @Manishearth @echeran @sffc:

This is a Rust type so Rust concerns should carry additional weight
The three adjectives "Maybe", "Potential", and "Quasi" have various different dictionary definitions and usages in science, and those definitions/usages align with varying degrees to the semantic we're aiming for:
- The statefulness of "Potential" is a pro more than a con
- "Quasi" implies that something is fake (almost but not), which is not the correct semantic
"Maybe" has a Rust ecosystem usage as either a safe enum (MaybeOwned) or an unsafe union (MaybeUninit) type; the union usage is a semantic roughly equal to what we're going for, but the enum usage has a greater chance of being misleading
"Maybe" means "Option" in other ecosystems like Haskell
"Potential" is a short version of "PotentiallyIllFormed"

sffc commented 2 months ago

Just to post this somewhere:

I think it would not be completely unreasonable or inconsistent with Rust style to introduce the following type

pub struct MaybeUtf8(pub [u8]);

impl MaybeUtf8 {
    pub unsafe fn assume_utf8(&self) -> &str { ... }
    pub fn try_to_utf8(&self) -> Result<&str, Utf8Error> { ... }
}

The parallelism of MaybeUninit::assume_init to MaybeUtf8::assume_utf8 is just not something I could close this thread without addressing first.

Manishearth commented 2 months ago

I think my main gripe with that is still the usage pattern, where the point of this type is actually that you can often use it without ever having to deal with validating or assuming UTF8, it's quite useful without those two. Because of that it feels very different from the stdlib unsafe helpers, which are more enumlike and the usage pattern is extremely stateful.