rust-bakery / nom

Rust parser combinator framework
MIT License
9.33k stars 800 forks source link

nom 6.0 roadmap #1132

Open Geal opened 4 years ago

Geal commented 4 years ago

placeholder for now, there is no plan

some ideas for now:

Geal commented 4 years ago

combinators will now require parser arguments to implement a Parser trait (instead of Fn) and return a FnMut, as of 84d1ac025ba2.

Waelwindows commented 4 years ago

How about making nom errors std compatible ?

Stargateur commented 4 years ago

Error in nom are still the most annoying thing IMO

Patryk27 commented 4 years ago

Working with the Parser's trait is a bit cumbersome, because each invocation of .and() creates an additional tuple:

tag("in ")
    .and(number())
    .and(tag(" days"))
    .map(|((_, days), _)| RelativeDate::Days(days))
       // ^^^^^^^^^^^^^^ nasty nesting

warp does some trait magic that automatically "flattens" enums [1], allowing - in our case - for:

tag("in ")
    .and(number())
    .and(tag(" days"))
    .map(|(_, days, _)| RelativeDate::Days(days))

Maybe something similar could be implemented for nom 6.0? It'd be a breaking change though, so perfect time to decide now :-)

[1] https://docs.rs/warp/0.1.16/warp/trait.Filter.html#extracting-tuples - in their case it's about parameters passed to route handlers

Geal commented 4 years ago

@Waelwindows it's there with bf83d5d9789893, but are you expecting specific behaviours from std::error::Error implementation? Like, implementing properly source() and backtrace() would require a new VerboseError implementation. @Stargateur do you have any specific issues in mind? Would things like the finish() method ( 9a555fe646488 ) help?

Stargateur commented 4 years ago

Well for example, make custom error is very hard, to do and to understand, I'm not sure I can do it even now. Then, if you want to use VerboseError, and want to return the error, often you can't keep the borrow, so I have this:

        let (_, sip_message) = sip_message(sip::Input::new(input))
            .map_err(|e| {
                e.map(|e| nom::error::VerboseError {
                    errors: e
                        .errors
                        .into_iter()
                        .map(|(i, e)| (String::from_utf8_lossy(i.fragment()).into_owned(), e))
                        .collect(),
                })
            })

It's work, but it's far from being user friendly. finish() look great.

The fact that ErrorKind doesn't have a Custom field or something, even, without user data. Make your parser hard to debug cause you use ErrorKind no custom variant but for your parser so it's hard to know if it's your custom error or the nom parser one.

Currently, I don't know if I can easy switch from VerboseError to ErrorKind to avoid unnecessary allocation in production but my parser is currently not the bottleneck of my project so it's ok for me for now.

(The enum Err not in module error make me confuse a lot of time)

Geal commented 4 years ago

@Stargateur in previous versions, ErrorKind had a Custom variant, but it actually made errors much harer to manipulate. Could you tell me more about the difficulties in making a custom error type? Maybe that's something I could solve with better documentation and some examples

ChristopherBiscardi commented 4 years ago

@Geal On the subject of custom errors, I tried to implement a custom error type and while I got it to work using https://github.com/Geal/nom/pull/1151 , I got stuck on using ? for error handling which wants a From impl I can't seem to provide (see match below for verbose example usage). I can continue with the more verbose form for now but would definitely appreciate either documentation on the below or an indication that I did something silly 😅

|     let (input, val) = alpha(input)?;
|                                    ^ the trait `std::convert::From<nom::Err<(&[u8], nom::error::ErrorKind)>>` is not implemented for `nom::Err<MyCustomError<&[u8]>>`
named!(hashtags, is_a!("#"));
named!(alpha, take_while!(is_alphanumeric));

fn atx_heading(input: &[u8]) -> IResult<&[u8], ATXHeading, MyCustomError<&[u8]>> {
    // long-form match is fine
    let result_tags = hashtags(input);
    let (input, hashes) = match result_tags {
        Ok(v) => v,
        Err(Error((i, e))) => return Err(Error(MyCustomError::Nom(i, e))),
    };
    // custom logic returning an error, also fine
    if hashes.len() > 6 {
        return Err(Error(MyCustomError::MyError));
    }
    // ? wants a From impl I can't provide
    let (input, val) = alpha(input)?;
    Ok((
        input,
        ATXHeading {
            level: hashes.len(),
            value: val,
        },
    ))
}

(edit: I couldn't get this to work tonight. Maybe another time)

Geal commented 4 years ago

@ChristopherBiscardi this is linked to #1010: when you use ?, the From implementation is applied on the error type ( https://doc.rust-lang.org/std/ops/trait.Try.html?search=#tymethod.into_result )

Unfortunately, I cannot fix it without trait specialization. in the meantime, if you already implemented From<(&[u8], ErrorKind)> for MyCustomError<&[u8]>, you could write this: alpha(input).map_err(Err::convert)?

I think I could add a helper to IResult, to write alpha(input)?.convert().

Another way to fix your issue would be to make a alpha parser directly with the error type you need:

fn alpha(i:&[u8]) -> IResult<&[u8], &[u8], MyCustomError<&[u8]>> {
  take_while(is_alphanumeric)(i)
}

or even shorter, inside your atx_heading function:

let  alpha = take_while(is_alphanumeric);
ChristopherBiscardi commented 3 years ago

I was able to get my example working with your suggestions, thanks!

My working custom error ```rust use nom::{ bytes::complete, character::*, error::{ErrorKind, ParseError}, Err::Error, IResult, *, }; // The datatypes we're parsing into #[derive(Debug, PartialEq, Eq)] struct ATXHeading<'a> { level: usize, value: &'a [u8], } struct SEHeading {} enum Heading<'a> { ATXHeading(&'a ATXHeading<'a>), SEHeading(SEHeading), } // A couple of named parsers, built with macros named!(hashtags, is_a!("#")); named!(alpha, take_while!(is_alphanumeric)); named!(spaces, take_while!(is_space)); #[derive(Debug, PartialEq)] pub enum MDXError { MyError, Nom(I, ErrorKind), } impl<'a> From<(&'a [u8], ErrorKind)> for MDXError<&'a [u8]> { fn from((i, ek): (&'a [u8], ErrorKind)) -> Self { MDXError::Nom(i, ek) } } impl ParseError for MDXError { fn from_error_kind(input: I, kind: ErrorKind) -> Self { MDXError::Nom(input, kind) } fn append(_: I, _: ErrorKind, other: Self) -> Self { other } } // parsing Markdown ATX headings, for which too many #'s fails fn atx_heading( input: &[u8], ) -> IResult<&[u8], ATXHeading, MDXError<&[u8]>> { let (input, hashes) = hashtags(input).map_err(Err::convert)?; if hashes.len() > 6 { return Err(Error(MDXError::MyError)); } let (input, _) = spaces(input).map_err(Err::convert)?; let (input, val) = nom::bytes::complete::take_while( is_alphanumeric, )(input)?; Ok(( input, ATXHeading { level: hashes.len(), value: val, }, )) } // a test to run to check if it's working #[test] fn parse_mdx() { assert_eq!( atx_heading( b"# boop " ), Ok(( " " .as_bytes(), ATXHeading { level: 1, value: b"boop" } )) ); } ```
That3Percent commented 3 years ago

With nom the tendency is to write recursive parsers. This fits with the mental model, and is encouraged by the examples.

The issue that arises is that when parsing input that contains deeply nested data structures the parser will fail with a stack overflow. This is a problem for security, as an attacker can intentionally craft input which if parsed will crash a process on demand - as noted in this issue.

The request in the issue is to have a recursion limit which causes the parser to fail without aborting the process. This comes with issues of it's own, and it would be better if it were possible to parse the input successfully regardless of the stack size.

I recently took a stab at converting a small part of a parser that I'm writing which was recursive to be stack based instead. It's not super pleasant, requiring more code and thought to make it correct. I managed, but I think to convert something more involved like json which has several different places where recursion can occur would be a mind bending exercise.

For Nom 6 my wish is that the most natural way to use the APIs ends up using some stack based backend instead of a recursion based one without making everything ugly and hard to follow.

jameysharp commented 3 years ago

I have two suggestions for nom 6:

  1. Remove the Needed::Unknown variant and just use Size(1) for that case;
  2. Eliminate the distinction between streaming and complete parsers, and make "completeness" a property of the input instead.

For the first: The only interpretation of an Incomplete result that makes sense to me is if the number of bytes returned means that reading any fewer bytes can't possibly generate an Ok result. (An Error or Failure result might be possible with less input, though.) If the length needed is not known, then the only reasonable behavior is to retry the parser once you have at least one more byte, right? I see that the Unknown variant dates back to 232f6766c6c00ecb8cc1e4425240f6e07b5cbaef in 2015 but it isn't clear to me why it was introduced.

The second proposal has more background.

I think completeness is better treated as a property of the input. Aside from special cases like length_value, the remaining input after a prefix has been parsed is complete if and only if the original input was. In current nom, if you want to switch between streaming and complete, you have to rewrite the entire tree of parsers. I think you should be able to attempt incremental parses and then try one more run of the same parser in "complete" mode if you read EOF.

I was trying to write a parser that would stop as soon as it found the information I need in a TLS ClientHello message. That requires parsing several layers of nested length-prefixed structures, so length_value seemed like the right combinator. But then I discovered that length_value doesn't invoke its second parser until it has at least as much data available as the first parser indicated it needs, so I can't use it for streaming parsing.

It was easy to change the implementation to do what I wanted, but then some of the tests failed because there's no way for length_value to indicate to its nested parser whether it's handing off a complete buffer or not. I considered adding a separate variant that only works with streaming sub-parsers, but that made my head hurt.

I spent a while thinking about how it could work instead, like allowing parsers to return an "incomplete but if there's no more data coming then here's the result I would produce" error, which complete would transform into an Ok. But that seemed awfully complicated too.

The best I can come up with is making input types be something like struct Input<T>(T, bool) or enum Input<T> { Complete(T), Incomplete(T) } or something along those lines. As a nice bonus, I think that means you can completely remove the streaming/complete split from the API, and have just one version of every parser. Those parsers that need to know whether the input is complete could just check, and the rest would blindly forward that flag to the next parser.

Geal commented 3 years ago

@jameysharp

Remove the Needed::Unknown variant and just use Size(1) for that case;

For the first: The only interpretation of an Incomplete result that makes sense to me is if the number of bytes returned means that reading any fewer bytes can't possibly generate an Ok result. (An Error or Failure result might be possible with less input, though.) If the length needed is not known, then the only reasonable behavior is to retry the parser once you have at least one more byte, right? I see that the Unknown variant dates back to 232f676 in 2015 but it isn't clear to me why it was introduced.

It really depends on the format and the IO pattern that is used. Using Size(1) signals that we know exactly how much is needed, but if we actually don't, reading just 1 more byte, parsing, returning Size(1) then reading 1 more byte, reparsing, might be too costly.

I think completeness is better treated as a property of the input. Aside from special cases like length_value, the remaining input after a prefix has been parsed is complete if and only if the original input was. In current nom, if you want to switch between streaming and complete, you have to rewrite the entire tree of parsers. I think you should be able to attempt incremental parses and then try one more run of the same parser in "complete" mode if you read EOF.

streaming VS complete can actually change inside a parser. As an example, a MP4 file can be an arbitrarily long list of atoms of different sizes, but each element in it is length prefixed.

For your use case, a custom input type would be a good way to experiment with it: https://github.com/Geal/nom/blob/master/doc/custom_input_types.md By implementing a serie of traits ( https://github.com/Geal/nom/blob/master/src/traits.rs ) you can have the input type you need, that would be used in a generic way by most parsers, but for which the additional bool could be recognized by parsers who need it

jhpratt commented 3 years ago

With regard to MSRV, what features from 1.44 (and 1.37) are used? Just curious on the feasibility of it being lower.

neithernut commented 3 years ago

@Geal

@Waelwindows it's there with bf83d5d

VerboseError does implement Debug and Display, but Error has also to be implemented (there's no blanket implementation that I'm aware of). Otherwise many error handling utilities will just not work with it. In this case it would be an empty implementation.

Edit: ah, there is an impl for nom::error::Error. Looks like I missed that one: https://github.com/Geal/nom/blob/c5090ef1e2a1704851b1773169653728b0761d9c/src/error.rs#L99

Edit2: still, it doesn't show up in the documentation, and I still fail to use this trait impl for some reason.

Edit3: aaand there is indeed no impl for VerboseError.

neithernut commented 3 years ago

Turns out what I wrote above is for naught, since the actual error type you usually encounter is nom::Err, not nom::error::Error or nom::error::VerboseError. Sorry for the noise.

However, in my defense I'll have to point out that this is somewhat confusing. In most cases, a type called Error will usually be used in a Result directly, rather than wrapped in another error type.

hbina commented 3 years ago

Hi @Geal can you release a new version with this fix in particular? https://github.com/Geal/nom/commit/59fb86a87ddc65d67dc649b8aff2ad5eacdd1f6c

Geal commented 3 years ago

There's 7.0.0-alpha1 available with the fix

On Fri, Jul 9, 2021, 08:01 Hanif Ariffin @.***> wrote:

Hi @Geal https://github.com/Geal can you release a new version with this fix in particular? 59fb86a https://github.com/Geal/nom/commit/59fb86a87ddc65d67dc649b8aff2ad5eacdd1f6c

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Geal/nom/issues/1132#issuecomment-876936696, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA5EAEY7TLJCYSCHHLFALTTW2GFLANCNFSM4MCHIEDA .