rust-lang / libs-team

The home of the library team
Apache License 2.0
115 stars 18 forks source link

ACP: Add `FromByteStr` trait with blanket impl `FromStr` #287

Open tgross35 opened 10 months ago

tgross35 commented 10 months ago

Proposal

Problem statement

Many data forms that can be parsed from a string representation do not need UTF-8. Here, FromStr is unnecessarily restrictive because a byte slice &[u8] cannot be parsed directly. Instead, a UTF-8 &str must be constructed to use .parse().

This is inconvenient when working with any raw buffers where one cannot assume that str::from_utf8 will be successful, nor is there any reason to incur the UTF-8 verification overhead. An example is IP addresses, for which there is an unstable from_bytes function: https://github.com/rust-lang/rust/issues/101035

Motivating examples or use cases

Any input data where UTF-8 cannot be guaranteed: stdin, file paths, data from Read, network packets, no_std without UTF tables, any data read one byte at a time, etc.

Any output data that doesn't require specific knowledge of UTF-8: integers, floating point, IP/socket addresses, MAC addresses, UUIDs, etc.

Solution sketch

Add a trait that mirrors FromStr but works with &[u8] byte slices:

// Likely located in core::slice
pub trait FromByteStr: Sized {
    type Err;

    // Required method
    fn from_byte_str(bytes: &[u8]) -> Result<Self, Self::Err>;
}

This will get a corresponding parse on &[u8]

impl<[u8]> {
    pub fn parse<F>(&self) -> Result<F, <F as FromByteStr>::Err>
        where F: FromByteStr
    { /* ... */ }
}

Since &str is necessarily represented as &[u8], we can provide a blanket impl so no types need to be duplicated:

impl<T> FromStr for T where T: FromByteStr {
    type Err = T::Err;
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        s.as_bytes().parse()
    }
}

If this is done, almost all types in std that implement FromStr will be able to switch to a FromByteStr implementation:

Alternatives

Open Questions

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

Second, if there's a concrete solution:

tgross35 commented 10 months ago

cc original reviewer @joshtriplett

zulip discussion: https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/addr_parse_ascii.20feature

thomcc commented 10 months ago

FromBytes feels like it might be too general of a name, it implies binary deserialization rather than textual parsing to me.

tgross35 commented 10 months ago

Agreed after rereading it. I had FromByteStr in alternatives and just added FromAscii, do either of those sound like a better fit?

BurntSushi commented 10 months ago

The lack of parsing working on &[u8] has annoyed me on several occasions over the years. This design looks plausible to me.

My inclination is to approve this, but since it was just opened, I figure I'll give it a little time to bake to give others a chance to chime in.

shepmaster commented 10 months ago

One complaint I’ve heard about FromStr is that the result cannot reference the input. That is, you can’t do &'a str -> MyType<'a>. Is that worth changing here?

tgross35 commented 10 months ago

That is an interesting thought. Do you mean a signature like this?

trait FromByteStr<'a>: Sized {
    type Err;
    fn from_byte_str(bytes: &'a [u8]) -> Result<Self, Self::Err>;
}

struct Foo<'a>(&'a [u8]);
impl<'a> FromByteStr<'a> for Foo<'a> {
    type Err = ();
    fn from_byte_str(bytes: &'a [u8]) -> Result<Self, Self::Err> {
        Ok(Self(bytes))
    }
}

impl FromByteStr<'_> for i32 {
    type Err = std::num::ParseIntError;
    fn from_byte_str(bytes: &[u8]) -> Result<Self, Self::Err> {
        std::str::from_utf8(bytes).unwrap().parse()
    }
}

// ...I think static is correct here?
impl<T: FromByteStr<'static>> FromStr for T {
    type Err = T::Err;
    fn from_str(s: &str) -> Result<Self, Self::Err> {
        todo!()
    }
}

That seems reasonable to me if we are okay with it not being a 1:1 matchup with FromStr. Which I think is probably okay, the biggest thing IMO is we need to be able to provide a FromStr blanket impl (I think we can with the above). Maybe also nice to use Error instead of Err to be consistent with everything else, but that's a bit of a bike shed....

I think that Err could also be made to optionally depend on 'a but I'm not sure what that signature would look like

scottmcm commented 10 months ago

I like the naming direction of "from byte str", since it helps emphasize it's like b"127.0.0.1", rather than like [0x7f_u8, 0, 0, 1]. Anything that says "bytes" makes me think from_ne_bytes and friends instead.

tgross35 commented 10 months ago

I updated the proposal to FromByteStr::from_byte_str, I agree that seems like the most unambiguous option.

It seems like everyone is either neutral or in favor, so I'll make a PR for this in the next week just to see how well everything fits together. @shepmaster's question is probably the biggest thing to answer.

joshtriplett commented 9 months ago

cc @epage ; would this work for clap, as a basis for parsing user types without going through strings?

joshtriplett commented 9 months ago

We discussed this in today's @rust-lang/libs-api meeting. We think it's important to distinguish between two possible interpretations of this:

  1. As purely an optimization, where the string is expected to be "unvalidated but expected to be valid UTF-8", and this trait just avoids doing the work of validating UTF-8 when that isn't needed.
  2. As a trait for parsing things from byte strings that aren't necessarily UTF-8. Case study: impl FromByteStr for std::os::unix::net::SocketAddr, which could accept an arbitrary file path.

We do still need to make it clear that this is parsing text, not binary; IpAddr is a good example of that, as @scottmcm noted. Parsing file paths or similar would be acceptable in case 2 above but not in case 1, but parsing binary would always be a misuse.

Please consider those two cases, document the methods accordingly, and we'll re-evaluate it accordingly. Thank you.

joshtriplett commented 9 months ago

Speaking for myself here (not the team consensus): I think case 2 is something that there's pent-up demand for in the ecosystem, and it seems unlikely that we'd be able to enforce exclusively case 1. I think there's value in providing and embracing case 2.

epage commented 9 months ago

@joshtriplett maybe?

clap would be working with &OsStr. We'd used OsStr::as_encoded_bytes with FromByteStr. If the FromBytesStr impl wants to interpret the non-UTF8 parts, the only tools they have are

If someone made impl FromByteStr for &OsStr (or added &OsStr::from_encoded_bytes), then it would likely be more usable for clap. When we added the encoded_bytes API, that was deferred out.

clarfonthey commented 9 months ago

I'm mostly against this because I think that something better should be designed than FromStr. Basically:

  1. TryFrom covers a lot of the cases where people just want a falliable conversion from a string, and it works with lifetimes in rather nice ways (both the error and the result can depend on the lifetime of the input).
  2. Some things would like to distinguish "conversion from string" and "parsing from string" in a way that would affect things. For example, a JSON value could be converted from a string (think: From or TryFrom) by creating a JSON string, but it could be parsed from a string which would yield a different result.
  3. Single-string parsing is a relatively inflexible case that falls apart even in simple cases. For example, without the internals of the parser for IpAddr, for example, it's difficult to implement a parser for SocketAddr without hacks.

Overall, this motivates me to conclude that a more generic parsing trait would be a fit replacement for FromStr, rather than simply adding in extra versions of TryFrom that perform particular parsing concerns.

At one point I was trying to experiment with what a useful API would look like for this, and this is about as far as I got. It's not complete or battle-tested, which is why I haven't submitted any ACPs or RFCs yet, but it's probably a decent example of why I think this approach as-is isn't very desirable.

BurntSushi commented 9 months ago

@clarfonthey I'm not sure that direction is the right path for std. That's a ton of infrastructure and surface area. (I realize what you have is a draft, so I'm speaking in broad strokes here.) I think there's room for a simpler solution that gets us 80% of the way there in std.

clarfonthey commented 9 months ago

What I have is 100% a draft, and that's the main reason why I haven't shared it until now. ;)

Honestly, the only reason I share it at all is because I want to demonstrate that an alternative is possible, it just needs some extra work.

I mostly share what I have to demonstrate the kind of direction I was going in. You're right that the goal should be simple, and that's why I think the current version I have is insufficient.

However, I think that there are a two simple operations which should be possible under a theoretical final version:

  1. Continuing on insufficient input by providing more input
  2. Stopping instead of erroring on extraneous input (and letting you continue where you left off)

Neither of these is a particularly big ask for any particular implementation, and they don't provide the ability to create complex parser combinators on their own. However, they're both vital to a proper parser implementation, and I don't think that anyone writing their own parsers should have to reimplement libstd parsers (for things like IP addresses) or hack in weird cases just because the API isn't really compatible.

To counter the API surface: I wouldn't consider these much more complex than iterators: the potential API surface is large, but it doesn't have to be. I think that an ideal API would be one that offers a relatively simple trait that's implemented by libstd types, but leaves out all the extra adapters and fancy methods to ecosystem crates.

clarfonthey commented 9 months ago

I decided to actually write up #296 to explain my design process behind the example I shared, particularly what features I felt would be useful to have from a potential design and why it was designed the way it was. If you follow the motivation, you'll get something very similar to what I designed, with a few potential changes.

It mostly runs counter to this ACP since it explicitly explains why I don't think that adding differently-named copies of TryFrom is a good solution, and what the solution should look like. But, it's worth knowing that said solution will inevitably be much more complicated than this solution, and that might be a reason to look in this direction after all.

Ultimately, hopefully it'll help us decide what we finally want to do with FromStr, since I truly believe that it should be deprecated either way. (Unless you can add GATs retroactively, which doesn't feel like a thing.)

epage commented 9 months ago

While I'm unsure if I personally have a use case for FromByteStr, I feel like it has a distinct role from TryFrom, just like FromStr and it offers a small, well confined API that is fairly easy to stabilize.

With #296, you are effectively proposing taking on the core of a full, general purpose parser into the standard library. I feel like a case would need to be made for why this is important vs third-party, why a specific design is picked (especially a new design) compared to the different third-party parser trait designs, etc. As an outsider to the libs team, I assume this would need to be vetted in a third-party package and likely collaborated on with parser maintainers. This feels like a completely different beast than FromByteStr in what it is trying to solve and has a long road to stabilization that I don't see it justifiable to block smaller forms of progress on it.

clarfonthey commented 9 months ago

The thing is, what exactly does FromByteStr gain exactly by being added to the standard library? It's very similar to TryFrom, and it's unclear why this trait particularly helps the standard library over something offered by another crate.

Speaking of which… I believe that most things that implement FromStr don't even have standalone methods that allow parsing from byte strings, which probably should be rectified. For example, from_str_radix exists but from_byte_str_radix doesn't.

Standalone methods for IP addresses and such would also be nice.