rust-bakery / nom

Rust parser combinator framework
MIT License
9.46k stars 804 forks source link

Challenges with using nom #1506

Open Firstyear opened 2 years ago

Firstyear commented 2 years ago

Hey mate,

Nom is a really powerful library and when it works well, it can do amazing things. But sometimes it can be extremely hard to find which of the many pieces of nom you need to use, and how to combine them. Even I have sat staring at the docs for hours, feeling ultimately very stupid, wondering at what point did my life choices go so wrong.

Anyway, generally when I get to this point, I always ask "what was the interaction failure that led to this feeling". Here in nom, I think there is a combination of things, but the docs are a really large part.

For example, I was wondering "how do I assert there are no bytes left in an input". So, since this is "bytes" and I have all the input, I went to:

https://docs.rs/nom/latest/nom/bytes/complete/index.html

But there is nothing here that does what I needed. Searching for "complete", "end", "size", "limit", did not find any results.

At this point I gave up and did something else, and when working on a different part of the code with a seperate nom parser I found: https://docs.rs/nom/latest/nom/combinator/fn.eof.html

So what was the "failure" here?

The primary issue is that it was hard to find and discover what it is that I needed. Often it's extremely difficult to locate what piece of nom you need, because the library being fragmented over so many modules, means that you effectively have ~9 different modules you need to look through. This makes it very difficult to locate combinators, and then effectively identify which one you may wish to use.

A possible solution is to "flatten" all the combinators to a single module, so that they can all be listed on a single page with the first line of their docs. This also has the benefit that every combinator needs to then be uniquely named, which helps to potentially resolve conflicts and confusion such as complete and streaming bytes. Alternatelly, this "flattened" module, could be made from pub-use of the other modules with aliases. Another benefit to this could be aliases to some combinators to help them be discovered easier.

Another barrier is that the docs of any parser can be extremely terse. Here's flat map for example:

https://docs.rs/nom/latest/nom/combinator/fn.flat_map.html

It took me multiple attempts to read this to understand what this combinator did, and how or why I would use it. The type signature is huge and has 8 generics. Even reading that it's pretty hard to see what was going on. The doc is a single line, that doesn't really explain the use case. And the example has no comments to indicate "what" that combinator is doing.

In otherwords, the current docs has basically a single line of (terse) human text to explain the type signature, and a single example that you are expected to reverse engineer and interpret. Especially for myself who failed mathematics and formal types and all that jazz, this makes it extremely hard for me to understand "how" or "why" I might care about flat_map!

I think a constructive improvement to this is that any combinators docs should have:

An example where this is done pretty well is "alt" https://docs.rs/nom/latest/nom/branch/fn.alt.html which has a better example and does a lot of what I just suggested.

An example of where this is done poorly is https://docs.rs/nom/latest/nom/error/fn.make_error.html where there are no examples of how to make an error, and I tried for about 45 minutes to work out what to do here, before giving up.

So I hope some of these thoughts help improve the docs and nom in general. It's a powerful library but I think that it needs to show that through it's documentation to make it more accessible, with different types of language for people to use :)

Xiretza commented 2 years ago

I agree that there are a lot of opportunities to improve the documentation - I think pull requests in this area are always welcome!

Firstyear commented 2 years ago

And I would certainly be happy to help write them except ... the barrier here is that without the documentation I don't understand all the modules so I ... can't write the documentation. I would be more than happy to be a docs reviewer though!

Firstyear commented 2 years ago

A good example of this is actually the problem I'm hitting right now. I'm trying to parse the following tagged value from a byte array into a u32 b"$5\r\nvalue\r\n". It must be a byte array because of the type of data I'm using (so I can't use the &str helpers).

So I can use tag combined with take_until to get:

let (rem, ln) = take_until("\r\n")(input)?;
// strip the \r\n that's left over at the start of value, makeing rem b"value\r\n".
// leaves ln as b"$5"
let (_, rem) = rem.split_at(2);

And here is where it becomes a mess.

I can use tag to get the "$" as the preceeding char, but trying to wrap that in map res as demonstrated in the recipes: https://docs.rs/nom/latest/nom/recipes/index.html#hexadecimal

Well, u32::from_str_radix only takes a str, not u8, meaning we have to do some unsafe.

let x = map_res(
    tag(b"$"),
    |out: &[u8]| {
        let a = unsafe { std::str::from_utf8_unchecked(&out) };
        u32::from_str_radix(a, 10)?;
    }
)(ln)

Ahh but that won't work because map_res into out only gives what matched on the tag not the remainder of the input, so it's trying to convert the "$" we matched. Nothing in combinator lets you say "if you matched on A, then map the remainder.

Try to break that out and we get:

let (strln, t) = tag(b"$")(ln)?;
let a = unsafe { std::str::from_utf8_unchecked(&strln) };
u32::from_str_radix(a, 10)?;

This also won't work, because whatever magic map_res was doing to convert ParseIntError to a nom error isn't there.

Trying to read https://docs.rs/nom/latest/nom/error/index.html and I can't work out how to map this error into something nom will be happy with.

I could try to use this:

    let a = unsafe { std::str::from_utf8_unchecked(&strln) };

    let strln = map_res(
        nom::character::complete::digit1,
        |out: &str| {
            debug!("{:?}", out);
            u32::from_str_radix(out, 10)
        }
    )(a)?;

But once again, you can not because

    |
71  |         nom::character::complete::digit1,
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ the trait `nom::error::ParseError<&str>` is not implemented for `nom::error::Error<&[u8]>`

Perhaps what I need is https://docs.rs/nom/latest/nom/sequence/fn.preceded.html but then I'm back to being unable to work out the magic to make the ParseIntError into something that nom is happy with.

This is such a perfect example of the kind of challenges that are commonly encountered with nom. Something that looks like it should be a trivial and simple task, ends up turning into hours of frustration as you can't work out which exact set of pieces you need to assemble together to make it work :(

EDIT:

When I did try and make my own error:

    let (strln, t) = tag(b"$")(ln)?;

    let a = unsafe { std::str::from_utf8_unchecked(&strln) };

    debug!("{:?}", a);
    u32::from_str_radix(a, 10)
        .map_err(|_| {
            Err(nom::Err::Error(
                nom::error::Error::new(strln, nom::error::ErrorKind::Digit)
            ))
        });
73 |             Err(nom::Err::Error(
   |             ^^^ cannot infer type for type parameter `T` declared on the enum `Result`
   |
help: consider specifying the type arguments in the method call
   |
72 |         .map_err::<F, O>(|_| {
   |                 ++++++++

And I have zero clue how to resolve this :)

Xiretza commented 2 years ago

I'm having a hard time following your comment, and github issues aren't really the best medium to figure things like this out anyway - but if you want to join the gitter room (also possible through matrix) maybe we can figure this out a bit more interactively.

Firstyear commented 2 years ago

@Xiretza as much as I appreciate the offer to assist with this specific technical problem, this issue report is meant to show a broader and larger problem - that nom, while being a really powerful, well built library, has extreme challenges with adoption and accessibility for new users. If I, a senior software engineer, with more than 10 years of experience, struggle to use this library from it's documentation, then how do many other people feel every day when they try? How many people have given up in silence and moved on to something else?

So again, my point here is "how can nom be improved so that a user can learn how to use it, and how can they find the pieces they need to be effective"?

damien-white commented 2 years ago

I don't see how this ticket could prove to be helpful to anyone. The ticket doesn't present a concrete problem that can be fixed.

In the opening comment, the author complains about not understanding the library due to lack of documentation. They then proceed to give a fairly in-depth walk-through of exactly how they would help improve the documentation.

When advised that they could open a PR, they declined. Even after that (and posting a huge wall of text and code), they were offered direct help. They declined that as well.

Firstyear commented 2 years ago

@dark-fusion I can't write documentation for something when I don't know how it works. I can't "magically" create that knowledge from nothing. It is completely valid for me to provide suggestions like this that can assist the authors who DO have the subject matter knowledge to that they can resolve the problem both for myself and others.

inferiorhumanorgans commented 2 years ago

@Firstyear

use nom::{
    bytes::complete::tag,

    // Use this instead of nom::number::u8 because you're going from ASCII characters
    // to numbers and not bytes to numbers (e.g. \u{0005}).  IMO the documentation doesn't
    // make this distinction readily apparent. Giving functions the same name as integral types
    // is unnecessarily confusing.
    character::complete::u8,

    multi::length_data,
    sequence::{delimited, terminated},
    IResult,
};

fn parze(input: &[u8]) -> IResult<&[u8], &[u8]> {
    // Use terminated to elegantly consume the trailing \r\n
    terminated(
        // length_data will return N bytes where the first argument is N
        length_data(
            // Use delimited for the same reasons as terminated.  Ultimately
            // what's passed into length_data is the result of the u8 combinator.
            delimited(
                // Our prefix.  This sh/could be put into its own function for ease of reuse.
                tag(b"$"),
                // This is the value we want, replace u8 with whatever size
                // e.g. nom::character::complete::u32 if you want to handle large numbers. 
                // for now u8 returns the length that length_data will use.
                u8,
                // This consumes the trailing "\r\n".  If any newline-esque
                // set of characters is appropriate there is a newline combinator.
                tag(b"\r\n"),
            )
        ),
        tag("\r\n")
    )(input)
}

#[cfg(test)]
mod test {
    use super::parze;

    #[test]
    fn test_parse() {
        let input = b"$5\r\nvalue\r\n";

        // Reassign input if you want to continue parsing
        let (_, value) = parze(input).expect("Don't use unwrap");
        assert_eq!(b"value", value);

        // Convert to a &str if you're dealing with UTF-8 and want to use
        // string/str methods
        let value = std::str::from_utf8(value).expect("Handle your UTF-8 errors here");
        assert_eq!("value", value);
    }
}

Alternatively:

fn parze(input: &[u8]) -> IResult<&[u8], &[u8]> {
    // Return the error with the ? operator here because we want to keep processing
    let (input, _) = tag(b"$")(input)?;
    let (input, length) = u8(input)?;
    let (input, _) = tag(b"\r\n")(input)?;

    // Don't use the ? operator because this is our last parser
    // Alternatively reassign input as above for take and then tag
    // and then return the produced value for take
    terminated(
        nom::bytes::complete::take(length),
        tag(b"\r\n"),
    )(input)
}

First of all don't use unsafe when parsing UTF-8 data. Either work with &str or handle the errors that a safe UTF-8 parsing function will return. For the vast majority of use cases if you're reaching for unsafe there's probably a better, safe, solution.

Mostly it seems like what's not being communicated effectively is how to deal with the remaining input. It takes some practice to wrap your head around that for sure, but hopefully the second example demonstrates how to do this. Error handling is tricky with nom, but since most/all of what you're doing should be covered by the stock combinators you shouldn't have to dig into that.

Insofar as map_err is concerned, nom has a map_res combinator for when you need to return a Result from your map closure, but it can be a bear when the compiler can't infer the correct types. For instance if you wanted your parser to return a &str instead of &[u8]. e.g.

// The first type for IResult is the input, the second the output
fn parze(input: &[u8]) -> IResult<&[u8], &str> {
    map_res(
        terminated(
            length_data(
                delimited(
                    tag(b"$"),
                    u8,
                    tag(b"\r\n"),
                )
            ),
            tag("\r\n")
        ),
        // from_utf8 returns a Result. Using nom's map_err will transform the error case into a nom error
        // and leave the resulting &str in the okay case.
        std::str::from_utf8
    )(input)
}

This is fine-ish:

    let strln = map_res(
        nom::character::complete::digit1,
        |out: &str| {
            debug!("{:?}", out);
            u32::from_str_radix(out, 10)
        }
    )(a)?;

The non-obvious thing here is that you've not destructured the tuple so strln will be (&str, u32). The problem is that there's not enough context to make the appropriate type inferences (and this will rear its ugly head with all sorts of rust code). The solution is to package things up a bit more cleanly. For example this compiles (however unless you need only one digit I'd just use the appropriate u8/u16/u32 combinator):

fn parze_u32(input: &str) -> IResult<&str, u32> {
    map_res(
        nom::character::complete::digit1,
        |out: &str| u32::from_str_radix(out, 10)
    )(input)
}
epage commented 2 years ago

I don't see how this ticket could prove to be helpful to anyone. The ticket doesn't present a concrete problem that can be fixed.

imo its a fairly high quality experience report that both gives us a mindset for failure points, reflects on the failure points, and provides suggestions for each, even if they aren't what should necessarily be done.

they were offered direct help. They declined that as well.

That isn't the point, the point is the experience of the user.

epage commented 2 years ago

Some quick thoughts on the original issue

For example, I was wondering "how do I assert there are no bytes left in an input". So, since this is "bytes" and I have all the input, I went to: complete

This shows people assume complete is what to reach for when no more input is expected. In this case, I'm assuming a name change wouldn't be appropriate but we can provide a "If you are looking for X, see Y" in the doc comment (in this case, point to eof for asserting there isn't any more input).

See https://github.com/Geal/nom/issues/1416 for a similar issue helping to make other combinators more discoverable.

Often it's extremely difficult to locate what piece of nom you need, because the library being fragmented over so many modules, means that you effectively have ~9 different modules you need to look through. This makes it very difficult to locate combinators, and then effectively identify which one you may wish to use.

I'm not sold on completely flattening but there are areas to improve here.

inferiorhumanorgans commented 2 years ago
  • I'm wondering what the point of failure was for the user not using the "choose a combinator" document

Agreed. The list of combinators went a long way towards figuring out where I should be looking, however it's incomplete and outdated in some places. More recently I went looking for the regexp combinators and couldn't figure out where they went. It looks like there's a nom-regex crate, but no repo for it?

inferiorhumanorgans commented 2 years ago

Another one that gets me: parsing a number from text. If I want to parse "12.34" to an f64 I would use nom::number::complete::double. If I want to parse "12" into an i64, I would not use nom::number::complete::i64 and instead should usenom::character::complete::i64.

Edit: OH yeah and the list, while great, is incomplete. For e.g. it's missing the value combinator.