rust-bakery / nom

Rust parser combinator framework
MIT License
9.18k stars 792 forks source link

New Combinator: discard_until / drop_until #1594

Open Trequetrum opened 1 year ago

Trequetrum commented 1 year ago

This is a combinator I use all the time, might be useful to see something like it in this crate.

It drops a byte at a time until the given parser matches, then returns the result.

I don't do parsing in any really performance sensative contexts, this can probably be better implemented. This impl demonstrates the idea.

fn drop_until<'a, T>(
    parser: fn(&'a str) -> IResult<&'a str, T>,
) -> impl FnMut(&'a str) -> IResult<&'a str, T> {
    map(many_till(take(1u8), parser), |(_, matched)| matched)
}
sshine commented 1 year ago

Isn't this equivalent to discarding the output of take_while?

let (s, _) = take_while(p)(s)?;
Trequetrum commented 1 year ago

Isn't this equivalent to discarding the output of take_while?

I don't fully understand understand how.


Lets say this is our input:

ahdHEahdkjbHELLOlkasjdLLadO

drop_until(tag("HELLO"))(input) 

returns:

OK(("lkasjdLLadO","HELLO"))

I suppose you could use

map(
    pair(
        take_while(not(tag("HELLO")),
        tag("HELLO")
    ),
    |(_, v)| v
)(input)

but is that better? Maybe... though it seems like this matches HELLO twice because of the not(tag("HELLO")) parser.

sshine commented 1 year ago

I see that drop_until can more easily express some things.

Maybe defining your format in terms of the complement of something is more typical when parsing binary file formats, or when extracting something that is embedded within what is otherwise considered junk, e.g. codes inside Markdown, CSV, or such, entirely skipping the embedding format.

For specifying language grammars, it makes more sense to positively define the thing you're skipping (comments, whitespace, etc.) even if you're just going to discard it. It was with this frame of mind that I assessed the usefulness of drop_until.

epage commented 1 year ago

Note that this somewhat parallels the conversation in #1223 / #1566 regarding how much should nom provide common parsers and people drop the output as needed vs provide specialized no-output parsers. One specific case of interest is in the multi module where there are specialized non-allocating variants as the overhead of capturing the output in that case is a lot higher. Note that instead of providing O=() parser variants, they are _count variants, not throwing the data away which is a common pattern in rust APIs.

Trequetrum commented 1 year ago

Note that this somewhat parallels the conversation in #1223 / #1566

Yeah, I can see that.

I think I'd address this less by the opportunity to tweak performance (as it seems like if your parser isn't allocating, but just returning a slice of the input, there's no performance hit) and more by an appeal to proving a gentle introduction to Nom.

Many users come to nom as a means to replace Regex (fully or in part) as Regex can quickly become unmaintainable as the complexity rises. Generally, regex was never a serious consideration when parcing language grammars, for example. Conceptually, regex is often used to match to some embedded pattern of tokens in a lager context. A way to pull desired information from an otherwise noisy document.

Having a few combinators that are a 1-1 match to this domain makes the first tepid steps into Nom so much easier for those specific users to take. What I don't know is just how common this case really is. My intuition is that there are a lot of developers who are familiar with regex who may just want to toy with Nom for curiosity's sake.

If that's true, they're likely going to try...

[...] extracting something that is embedded within what is otherwise considered junk, e.g. codes inside Markdown, CSV, or such, entirely skipping the embedding format.

epage commented 1 year ago

more by an appeal to proving a gentle introduction to Nom.

With clap, a common problem I find is the larger the API is, the more likely people are to not find functionality they need. I feel like nom is on the cusp of that and would hope that nom limits the convenience variants of parsers to help new users with nom.

Many users come to nom as a means to replace Regex (fully or in part) as Regex can quickly become unmaintainable as the complexity rises. ... Having a few combinators that are a 1-1 match to this domain makes the first tepid steps into Nom so much easier for those specific users to take. What I don't know is just how common this case really is. My intuition is that there are a lot of developers who are familiar with regex who may just want to toy with Nom for curiosity's sake.

For some reason I don't see how this helps with aligning with regex. Maybe enumerating regex features and how you feel they line up with existing or potential parsers would help. That could also be a useful piece of documentation for nom.