rust-bakery / nom

Rust parser combinator framework
MIT License
9.43k stars 805 forks source link

Size from incomplete streaming crlf parser #1365

Open BoniLindsley opened 3 years ago

BoniLindsley commented 3 years ago

The parser for matching \r\n has a streaming version. I would like to ask about the case where the input is exactly \r. This case is covered in unit test, and checks for an incomplete error containing Needed::new(2). https://github.com/Geal/nom/blob/e2a0dd52e07c6c55ab4ee7fad88e987a814a7a0f/src/character/streaming.rs#L1093

My understanding of Needed is that it counts the number of extra bytes necessary to parse. And in this case it should be one. Is returning two instead of one deliberate? https://github.com/Geal/nom/blob/e2a0dd52e07c6c55ab4ee7fad88e987a814a7a0f/src/character/streaming.rs#L149

I recognise that this one byte does not matter in the grand scheme of things. But for correctness, perhaps Needed::Unknown is more appropriate? It is technically known. Checking the input length for zero or one should given an answer, but the input trait does not allow for this, I think. Also technically possible by comparing \r, but seems over the top for the difference of one byte.

Meaning of "needed"

This is a tangentially related question. Consider an alt of two streaming tag parsers matching "hello" and "help", acting on "hel". We may have a result in one or two bytes. What error should this return?

It currently returns one or two, depending on which tag comes first in the alt list. This feels ambiguous.

fn alt(input: &str) -> nom::IResult<&str, &str> {
    nom::branch::alt((
        nom::bytes::streaming::tag("help"),
        nom::bytes::streaming::tag("hello"),
    ))(input)
}

assert_eq!(
    alt("hel"),
    Err(nom::Err::Incomplete(nom::Needed::new(1))),
//  // If "hello" first.
//  Err(nom::Err::Incomplete(nom::Needed::new(2))),
);

Sorry if this seems irrelevant. I am trying to understand what I can rely on from the needed value, and what to return when writing custom streaming parsers.

Environment

cenodis commented 3 years ago

I am trying to understand what I can rely on from the needed value

If a parser returns some amount of needed bytes x.

There is no hard guarantee that lets you know for sure when you have added enough input (except calling the parser and having it return a result). But Needed lets you know when you can skip calling the parser because you dont have enough input yet.

It currently returns one or two, depending on which tag comes first in the alt list. This feels ambiguous.

Its not really ambiguous if you keep the information above in mind. If alt returned the highest Needed value from all of its subparsers then it might return too much. To pick up your example, imagine the incomplete "hel" actually completes to "help". There is only 1 byte missing until the parser returns a result, but picking the largest Needed value would result in 2. The alt parser can only know for sure that 2 bytes are needed if the first parser indicates that it can definitely not match the input, which is only possible if it has enough input to make sure thats correct. In conclusion, there is only 1 byte Needed before the parser may or may not return a different result.

Also in order to collect the highest Needed value alt would have to continue executing every parser in its chain, collect the number of needed bytes for each and then return the maximum from that list. This would mean a massive hit to performance and is therefore not done.

That being said the crlf parser should probably do an additional check on Incomplete to see if the first \r character is present and then only return Needed::new(1).

BoniLindsley commented 3 years ago

If a parser returns some amount of needed bytes x.

  • If you add less than x bytes to the input the parser will most likely return Needed again.
  • If you add x or more bytes to the input the parser may or may not return Needed again.

Thank you for clarifying. This is part of why I felt confused. If both cases boil down to "may or may not require more data", why not just treat every Needed value the same as Needed::Unknown?

There is no hard guarantee that lets you know for sure when you have added enough input (except calling the parser and having it return a result). But Needed lets you know when you can skip calling the parser because you dont have enough input yet.

I guess the point is, then, that reading the Needed number of bytes does not guarantee completion, and is an advise for meaningful forward progress in parsing, for some definition of "meaningful".

To pick up your example, imagine the incomplete "hel" actually completes to "help". There is only 1 byte missing until the parser returns a result, but picking the largest Needed value would result in 2. The alt parser can only know for sure that 2 bytes are needed if the first parser indicates that it can definitely not match the input, which is only possible if it has enough input to make sure thats correct. In conclusion, there is only 1 byte Needed before the parser may or may not return a different result.

In this case, I suppose the meaningful progress after reading one more byte would be the confirmation or elimination of one tag.

Also in order to collect the highest Needed value alt would have to continue executing every parser in its chain, collect the number of needed bytes for each and then return the maximum from that list. This would mean a massive hit to performance and is therefore not done.

Indeed. Such an approach feels impractical, and the result not worth the effort. You have convinced me that Needed should not be seen as an upper bound for a completion.

That being said the crlf parser should probably do an additional check on Incomplete to see if the first \r character is present and then only return Needed::new(1).

This brings me back to the main question: Is Needed::new(2) considered one of many acceptable return values because it is advisory, with Needed::new(1) being preferred?

As for the implementation detail, would it make more sense to just check for \r and then \n separately?

Stargateur commented 3 years ago

line_ending() could be write:

pub fn line_ending<I, E>(input: I) -> IResult<I, (Option<char>, char), E>
where
    I: InputIter + Slice<RangeFrom<usize>>,
    <I as InputIter>::Item: AsChar,
    E: ParseError<I>,
{
    pair(opt(char('\r')), char('\n'))(input)
}
Lucretiel commented 3 years ago

I agree that this is a bug. Needed should include the minimum number of bytes required to make progress, signaling to a streaming system to buffer at least that many bytes before retrying. It's easy to imagine this resulting in correctness issues: suppose you had a CRLF delimited document, that had been read up to the CR. The parser returns Needed(2), but the stream can only provide 1 more byte, which would mean the system would (incorrectly) report an error of an unexpected EOF.

Regarding your alt + Needed question: This is something I just recently encountered as well. I'm not really sure I know what the best solution is; I've juggled with things like "if they all return Needed, return Needed with the lowest value". However, I think it's important that alt always returns the first matching subparser, and designers using it come to rely on this behavior. I think it therefore makes sense that Needed is propagated immediately, because the earlier parser could succeed, and therefore we should get more bytes to give it the chance before continuing with other branches.

Stargateur commented 3 years ago

The parser returns Needed(2), but the stream can only provide 1 more byte, which would mean the system would (incorrectly) report an error of an unexpected EOF.

That not how I see thing, I think one should read it as "at least 1", you try to read as much data as possible and try again, if you read at least one more octet you can retry.

Maybe:

BoniLindsley commented 3 years ago

I agree that this is a bug. Needed should include the minimum number of bytes required to make progress [...]

I had wondered about that. But I could not think of an acceptable definition for "progress". For example, if progress is an opportunity to complete or error out, then for most parsers, the minimum would just be one, by virtue of them doing some sort of filtering. (On the other hand, I had only started on my Rusty and nommy journey a month ago, so may be I do not have enough experience to say that.)

[...] The parser returns Needed(2), but the stream can only provide 1 more byte, which would mean the system would (incorrectly) report an error of an unexpected EOF.

Hmm... I guess that might be a problem because... if remote closes, then the stream might have to wait for a timeout?

I think it therefore makes sense that Needed is propagated immediately, because the earlier parser could succeed, and therefore we should get more bytes to give it the chance before continuing with other branches.

Indeed. It does make sense to keep this behaviour. I was not troubled by this particular example myself. I merely used it as an illustration of the issue -- well-definedness of what Needed means. And we already have different people interpreting them differently! Keeping track of a minimum Needed value amongst alt subparsers sounds like a good compromise. That might not be a "real" minimum though, if the input will fail every subparser in the next byte.

Change name variant of Needed::Size to Hint

I suppose that makes sense from the user's point of view. If a Needed value is returned, what do we expect them to do with it?

  1. If it is a hint, then "try to read that many bytes, and then parse again, as long as you managed to read something." In this case, then, what conditions do we have to not parse again? Specifically, how is the handling different from a Needed::Unknown?
  2. If it is minimal, then may be "don't bother parsing again without reading that many bytes".
Lucretiel commented 3 years ago

That not how I see thing, I think one should read it as "at least 1", you try to read as much data as possible and try again, if you read at least one more octet you can retry.

I think I agree? Definitely a buffered stream reader should be reading in large chunks. I only mean that Needed(2) means that you need at least 2 more bytes, so a buffering system needs to read at least 2 more bytes before it retries the parser.

When I say "can only provide one more byte" I mean that the underlying reader (ie, a TCP stream) provides one more byte and then the connection closes. In that case, given a Needed(2), a parser adapter could correctly conclude that "the underlying parser requested at least 2 more bytes, but I was only able to get one more byte, so I can return an unexpected EoF without retrying the underlying parser"

Lucretiel commented 3 years ago

Specifically, how is the handling different from a Needed::Unknown

I agree that I haven't been able to think of a useful distinction here. I think that semantically there's a difference:

However, in practice I don't know the difference between Unknown and Needed(1); in both cases the behavior should be "load at least one more byte (but probably more) and retry. Potentially it could be good for error reporting.