rust-lang / libs-team

The home of the library team
Apache License 2.0
115 stars 18 forks source link

ACP: Designing an alternative `FromStr` #296

Open clarfonthey opened 9 months ago

clarfonthey commented 9 months ago

Proposal

Note: this is a proposed alternative to #287 and probably other proposals before it. The solution is not complete, but it describes the problem in a way that can be agreed upon so that future experiments can be better specified. Basically, we have to agree on the shape of a solution before we can agree on the actual solution.

Problem statement

Right now, the current mechanism for "parse this thing from a string" is FromStr, which has the following issues:

Motivating examples or use cases

The principle example of the separation between FromStr and TryFrom is JSON objects, like serde_json::Value. A string can be represented in JSON, usually where the conversion is a Value::String constructor. However, strings can also be parsed into JSON, representing other objects.

From and TryFrom are the former type of conversion, and while there are some unanswered questions for things like truncating integers, etc., those are orthogonal to this RFC. The main discussion here is how we can offer a mechanism that parses objects from strings.

Since we know that this mechanism is different from TryFrom, ideally, it will not take the same form as TryFrom. Ideally, parsing will be good enough that the relatively simple parsing of things like integers, IP addresses, etc., will be reusable in other places.

There are two main problems I'd like to solve with the solution, but these are the points that should be discussed, since they do complicate the solution shape.

Extraneous Input

Parsers should be able to handle extraneous input from a string. For example, I should be able to pass in a string and get out a parsed number and the remainder of the string past the parsed number.

However, this also means that we need an additional case for the usual behaviour, which states that if the remaining string is nonempty, it should fail.

For example, take the string "1234abcd". In this example, we could parse the number (in decimal) 1234 and then return "abcd" since none of the characters are valid decimal digits. If we wanted the entire string to be a number, we could fail based upon the string "abcd" noting that there's extra characters after the number.

Insufficient Input

Parsers should be able to handle insufficient input from a string. Combined with the above, we get the ability to do proper, buffered parsing of values without the need to split on any known terminator like \n. If we pass in a string to the parser and there is no extraneous input, we should be able to continue passing in strings until either input is exhausted or something is invalid, which tells us that we're finished.

For example, if the strings "1234", "5678", and "90ab" get parsed in, we should end up with the number 1234567890 and the extra input "ab". The strings don't have to be the same length, but they are to demonstrate the buffering example.

This requires distinguishing whether we could accept more input if the extraneous input is an empty string. For example, if trying to parse a u8, the string "55" could accept more input and become a valid u8, whereas "255" could not, since any future digits would overflow. A simple implementation could decide to ignore all of this "overflow foresight" and instead just keep accepting digits and stopping on non-digits, although this is worth keeping in mind.

Solution sketch

Note: a current (working) solution exists in this repo, but it's explicitly not the solution I'm going to propose here. I'm just sharing it for completeness to indicate what I've tried to accomplish for this in the past.

Parsers

To solve the insufficient input issue, we would have to separate the notion of parsing from a parser, since we need some intermediate value that can accept a series of strings. This is the trade-off we get if we choose to allow insufficient-input parsing, but I personally believe it's worthwhile. In the example shown, I distinguished them using a BuildParser and Parse trait similar to BuildHasher and Hash, but any possible design is valid.

This means that, instead of having a method like from_str_radix on integers, we'd have a with_radix method that accepts a radix and returns a parser. Similar to BuildHasher's hash_one method, we could also have a parse_one method on BuildParser, and from_str_radix could be considered a shorthand for this.

Parser States

Parsers should take in string inputs and return results with… something. Parsed values can have the usual hard errors, but they can also have "soft" errors indicating extraneous input, etc. A good example of a hard error for parsing integers is if there have been no valid digits encountered; an empty string is not a valid number, so, simply starting with the string "abcd" would result in some error along the lines of, no digits encountered.

While the examples given will simply use Result and indicate hard errors with Err, it's noteworthy that some ways of combining parsers would turn all hard errors into soft errors, and that it's useful for some kinds of parsers to indicate whether an error in the data itself is hard or soft. This is something to consider in the shape of the final API, but not as important for the experimental stage.

Now, besides the hard error state, there are three possible states for the parser:

  1. The value is incomplete, and more input is needed. For example, passing in "127." for an IP would be this state.
  2. The value is complete, but more input could be provided to change it. For example, passing in "::" for an IP would be in this state; you could parse "::1", for example.
  3. The value is complete, and any additional input would be extraneous. For example, "[1]" is valid JSON and any other input would be extraneous.

I'm choosing to treat 3 as the same state regardless of whether any extraneous input is actually provided. For example, there's no reasonable distinction to be had whether you type "[1]" or "[1]a", since the extraneous input in the former could simply be treated as an empty string. We're already dealing with an API that can get complicated, and I don't want to further complicate it.

Accepting Input

So, while separating out a parser and a parsed value is one half of the API distinction, the other half is how we cut off the input returned to the parser. There are a few different ways we could handle this. I'll use Option and Result terms for brevity but we could in theory create other types instead. (The example actually uses Poll.) This lists the potential return values after passing in input:

  1. State 1 returns Ok(None), state 2 returns Ok(None), state 3 returns Ok(Some((val, extra))). At any point, a user can call a "finish" method to signal EOF; this returns Err(Insufficient) for state 1, Ok(val) for state 2, and the result of parsing an empty string for state 3 (either Err(Insufficient) or Ok(val) depending on whether an empty string is okay).
  2. State 1 returns Ok(None), state 2 and 3 return Ok(Some((handle, extra))). handle borrows the parser and must be dropped or converted into a value before adding more input.
  3. State 1 returns Ok(None), state 2 and 3 return Ok(Some((val, extra))) and parsers can only parse exactly one value. If state 2 happens before state 3, multiple values are constructed, and previous intermediate results can be dropped if more input is obtained. There can be methods on parsers that clear them, but those are outside the API here.

I personally prefer 3 because it's the simplest, and it actually allows us to secretly do option 2 as well. If we allow the parser to borrow itself when returning the value, we could avoid throwing away any intermediate results by simply directly borrowing the result.

For example, we could literally make the parsed value be the handle we discussed, and then calling a method to remove the value would reset the parser.

Parameterization

Parameterizing the API based upon what strings are being parsed feels pretty straightforward: the traits can accept a generic type that would default to str, and then you can pass in [u8], [u16], OsStr, CStr, Path, etc. as you please.

I'm going to mostly leave this as an afterthought to the API because I believe it can be easily incorporated later, and we do have GATs to be able to depend on lifetimes for things, but it's still a point that should be addressed.

In the example I gave, I used Default + PartialEq as the bound on the parameter, under the assumption that you could do something like extra == Default::default() to check if there's any extraneous input. We could probably replace it with a custom trait or trait alias. I don't think that we need something particularly complicated, though.

Errors

In the example I provided, I added two methods, insufficient and extraneous, which conjure errors based upon insufficient and extraneous input when such errors are desired. Ultimately, these options will have to be available either through the parser or the error itself, and it's worth mentioning that they are required.

This could be motivation for a standard ParseError type, like io::Error, where these variants are automatically included. I personally prefer parameterization, since parse errors tend to be sophisticated and being able to recover from them is usually helpful, but it is a valid path as well.

Common Parsers

"Overflow foresight" was something I mentioned earlier, and it's something I would propose to avoid in libstd parsers. I think that any numerical parser should continue accepting input as long as only digits have been seen, and if overflow happens, an error is returned.

Ultimately, I don't think that anyone would ever want to parse the string "255255" as two u8s concatenated together instead of one invalid u8, and we should probably act similarly.

Combinators

It's worth mentioning that, like iterators, parsers of this form can easily be combined together to create more complicated parsers. I would like to explicitly propose here that the standard library should not take on this role, and this is much better suited to the ecosystem. However, I would also like any solution to be able to reuse libstd implementations of parsers without having to roll your own.

For example, parsing a network prefix like 127.0.0.1/8 should not require searching the string for a /, parsing the part before as an IP, and then parsing the 8; it should simply require parsing the IP, then parsing the remaining input as /8 or an empty string.

Alternatives

We could have parsing be another type of conversion, like the truncating conversions proposed in various attempts to convert as into a From-like trait. Or, we could just remove FromStr entirely and say that we don't need a custom parse method.

I personally disagree with this, since obtaining input from I/O is a common operation and the solution should mirror that. However, I don't think that an alternative similar to FromStr is worth it: in that case, we should just rely on TryFrom instead.

One other, interesting alternative is to allow the entire API, but not parameterize it by strings and explicitly only accept byte strings. This would go more in-line with allowing buffered input, and means we could strap it into the io module somehow in the future. I would prefer such an implementation to simply parameterize [u8] there, but it's worth mentioning regardless.

Links and related work

Alternative ACPs:

Ecosystem parser traits (thanks @epage):

Other discussions:

(I know they exist, but am not going to go looking right now. We'll add them here as they get pointed out.)

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

Second, if there's a concrete solution:

epage commented 9 months ago

Has a nebulous relationship with TryFrom and From (when should TryFrom, From, and FromStr differ?)

This feels contradictory to the opening of "Motivating examples or use cases" which establishes that there are clearly different intended use cases.

epage commented 9 months ago

Prior art for parser traits

epage commented 9 months ago

The value is incomplete, and more input is needed. For example, passing in "127." for an IP would be this state.

If parsers are meant to be composed, it'd be a real pain to deal with incomplete data in the Ok case compared to Err.

epage commented 9 months ago

Parsed values can have the usual errors, but they can also have "soft" errors indicating extraneous input. A good example of an actual error for parsing integers is if there have been no valid digits encountered; an empty string is not a valid number, so, simply starting with the string "abcd" would result in some error along the lines of, no digits encountered.

If these are meant to be composed, then any error can be a "soft" error and cause backtracking within the parser.

clarfonthey commented 9 months ago

Thank you for the list of parser traits! I omitted your notes + ordered them alphabetically (for fairness) in the ACP, but you do provide useful insight into the design of these. I had looked into these as I was designing the example a while ago, although for some reason it completely didn't cross my mind to include them in the ACP.

Has a nebulous relationship with TryFrom and From (when should TryFrom, From, and FromStr differ?)

This feels contradictory to the opening of "Motivating examples or use cases" which establishes that there are clearly different intended use cases.

I should add, this is 100% based upon the naming of FromStr and its current shape; it looks and sounds like a version of TryFrom specific to strings, but it's actually a generic parsing trait. Open to suggestions on how to better word this. You're right that the use cases are distinct, which is why I believe parsing should be done with a dedicated parser; it's up to this ACP to decide whether such a parser trait should exist in libstd though.

If parsers are meant to be composed, it'd be a real pain to deal with incomplete data in the Ok case compared to Err.

That's fair, and it'd be a very good case for having a custom Result instead of what I presented as well. One of the first designs I came up with did this approach, and looked similar to nom::Error, but ultimately, I rejected it on the grounds of it being too complicated for libstd. I will try and better word this in the ACP though.

epage commented 9 months ago

That's fair, and it'd be a very good case for having a custom Result instead of what I presented as well. One of the first designs I came up with did this approach, and looked similar to nom::Error, but ultimately, I rejected it on the grounds of it being too complicated for libstd. I will try and better word this in the ACP though.

side note: imo in designs I'd recommend referring to winnow::ErrMode as it tries to clear up a lot of the naming confusion with nom::Error

It doesn't have to be a choice between Poll and including ErrMode. You can design it so people can opt-in to an ErrMode that fits their needs (backtracking, incomplete data, etc).

scottmcm commented 9 months ago

Big fan of extending it to support at least some basic compositionality. The fact that one can't trivially use u32 as FromStr even to parse, say, a comma-separated list of numbers seems suboptimal. I like the idea of a parse_prefix -> Result<(Self, &str), E> kind of thing that parse could be implemented in terms of by checking that there was no tail. (It makes me think of the kinds of things you can do with https://doc.rust-lang.org/std/str/struct.Utf8Error.html#method.valid_up_to.)

That said, I'm really unsure how far we can reasonably go in core. You mention lots of things that are great features, but I worry that offering them all might make it too complicated to ever find the "obviously right" API for stuff. If we don't even have Display for a Vec, how could we ever have a smart parser for one? And things like String can't be partially parsed from a &str anyway, since there's no way to know when to stop.

Maybe there's a way to offer some shared things for the basics of simple length-limited integers and characters and such that can be reused by other things, but that the bigger infrastructure should be left to crates?

clarfonthey commented 9 months ago

I mean, what you're proposing is a valid other option: design things that fit an API similar to that proposed, but don't provide any actual traits to do so. I think that there should be at least some separate parser object rather than a series of methods like parse_prefixed_etc but maybe an alternative could be to standardise on a way to indicate the maximum length buffer something could have.

But this could throw a wrench into generalising since, for example, 0255 is a valid u8, but the presumed max length would be 3.

This is mostly why I propose separate parsing objects that can be passed additional strings, since that way we don't have to worry about those cases. But I could see an argument made for what you're proposing if done properly.

programmerjake commented 9 months ago

In the example I gave, I used Default + PartialEq as the bound on the parameter, under the assumption that you could do something like extra == Default::default() to check if there's any extraneous input. We could probably replace it with a custom trait or trait alias. I don't think that we need something particularly complicated, though.

sounds like an excuse to add an IsEmpty trait...

pub trait IsEmpty {
    fn is_empty(&self) -> bool;
}

impl<T> IsEmpty for [T] { ... }
impl<T> IsEmpty for Vec<T> { ... }
impl<T: ?Sized + IsEmpty> IsEmpty for &'_ T { ... }
impl<T: ?Sized + IsEmpty> IsEmpty for &'_ mut T { ... }
impl<T: ?Sized + IsEmpty> IsEmpty for Box<T> { ... }
...
afetisov commented 7 months ago

I feel this is missing the answer and motivation for the core question: why should this be included in the std? The parser ecosystem of Rust is quite healthy, the crates seem to be evolving fine on their own. It's also not like an issue with async runtimes, where a lack of common traits makes different crates non-composable. There is no issue with using different parser crates in different parts of the project, if need be. Also, from a cursory glance at the API of mentioned crates, it's already obvious that there are multiple valid entirely different approaches, which casts doubt on any effort to invent a single best API for std. And it's not even a complete list of parser crates & traits. Some approaches to parsing also include intermediate steps, like lexing, so that a parser operates on tokens rather than raw buffers. Some approaches also merge steps like parsing and lexing into unified traits, e.g. a lexer works on an iterator of bytes, while a parser works on an iterator of tokens.

The proposal also includes way too many features & issues to hope to find a single best solution. Statefulness, returning unparsed trailing bytes or erroring on their existence. Parsers which support parsing from multiple data parts! Many applications don't need some of those features. For example, I have never required multipart parsers in my line of work, and stateful parsers which can be continued from partial parse states are usually more trouble than they're worth. Once you have a fast lexer, unless we're talking about multi-megabyte inputs it's easier to reparse the entire input if need be than to try to keep track of parser state (which could be entirely contained in local variables of the relevant functions).

Cannot parse non-str content (most common case is byte strings)

That's essentially trying to include deserialization as a special case of this API, which means that serde should be included in the prior art. serde offers yet another entirely different API, due to the requirement to support different serialization formats for the same data. Note that it's still not general enough, e.g. it's famously unfit for protobuf parsing (the prost parser uses entirely different traits). It also poorly fits for ASN.1 (de)serialization, due to the possibility of different serialization of the same data in different slightly different variants of the same format.

The only compelling reason to change FromStr that I see is being unable to return a result depending on the lifetime of the parsed string. That one is a very clear and specific limitation, although it's hard to change backwards-compatibly.

clarfonthey commented 7 months ago

it's easier to reparse the entire input if need be than to try to keep track of parser state (which could be entirely contained in local variables of the relevant functions).

Have you considered that the parser state in question would be of the form of a struct which could be contained in local variables of the relevant functions?

For example, the incredibly complicated parser state for an integer parser might just be… the current value of the integer being parsed, wrapped in some newtype, maybe with a boolean flag or two for ensuring you fail on empty input/multiple signs passed.

If some value straddles multiple I/O reads, instead of being forced to allocate a buffer two (or more) chunks long, it's much easier to just store the state between reads. That state isn't kept in some complicated global parser state, just a local variable, like you said; the difference is that it's encapsulated in some API struct rather than a handful of hand-written variables.

Also--

Once you have a fast lexer, unless we're talking about multi-megabyte inputs it's easier to reparse the entire input if need be than to try to keep track of parser state (which could be entirely contained in local variables of the relevant functions).

This issue also applies to lexing, since tokens representing numbers, dates, or any other pre-processed value would otherwise require allocating additional space between reads. Much easier for a lexer in a situation like this to hold onto its state between the reads than to allocate a bigger buffer and parse multiple times.