kevinmehall / rust-peg

Parsing Expression Grammar (PEG) parser generator for Rust
https://crates.io/crates/peg
MIT License
1.44k stars 105 forks source link

Add a way to capture the delimiters in a delimited repeat. #259

Open tomprince opened 3 years ago

tomprince commented 3 years ago

I'm looking at migrating full-moon to use rust-peg for parsing. However, since it captures the entire text (such as whitespace and comments), I need to be able to capture the result of delimiters, as well as the main item, if I were to use **, or ++.

kevinmehall commented 3 years ago

You could do something like:

rule list<I, S>(item: rule<I>, sep: rule<S>) -> (Option<I>, Vec<(S, I)>)
        = first:item() items:(s:sep() i:item() { (s, i) })* { (Some(first), items) }
        / { (None, vec![]) }

rule use_it() = list(<expr()>, <comma()>)

which is kind of like what ** expands to.

I would be interested to hear your experience and pain points in using this library for a lossless parser. Are you producing a typed or untyped syntax tree?

tomprince commented 3 years ago

You could do something like: [...]

It looks like the use of rule<...> as a type of rule argument isn't documented anywhere.

I would be interested to hear your experience and pain points in using this library for a lossless parser.

I've only just started working on converting the existing hand build parser to peg, so I don't know what pain points I'll run into. This is the first major one.

A couple of minor points:

[1] I'm adapting an existing split lexer + parser, that clusters trivia like whitespace/comments with the adjacent tokens before parsing, so I'm parsing these token clusters (which also include position information, but only care about the root token for determining the parser. [2] I realized as I was writing this that I could also use [token] {? if token ... } but something like

rule number() -> TokenReference<'text>
    = [token] {? if let TokenType::Number { number } = *token {
            Ok(token.with_value(number))
        } else {
            Err("not a number")
        }
     }

still feels a little bit awkward.

Are you producing a typed or untyped syntax tree?

I'm not sure what you mean by this?

tomprince commented 3 years ago

I would be interested to hear your experience and pain points in using this library for a lossless parser.

I just discovered that I can't implement ParseLiteral for my [T]. I was going to experiment using this to allow matching symbols in the parser using string literal syntax. Though, even if I could, that would allow me to write grammar with an invalid symbol that would only be detected at runtime.

godmar commented 3 years ago

You could do something like:

rule list<I, S>(item: rule<I>, sep: rule<S>) -> (Option<I>, Vec<(S, I)>)
        = first:item() items:(s:sep() i:item() { (s, i) })* { (Some(first), items) }
        / { (None, vec![]) }

rule use_it() = list(<expr()>, <comma()>)

which is kind of like what ** expands to.

I also have a use case where I'd like to collect the delimiters. For instance, in a bash-style shell grammar, pipelines are separated by & or ; and within a pipeline, commands may be separated by | or |&. Before stumbling on this issue, my solution required 4 rules instead of 1 in each case; in general, with n choices of delimiters, it would be 2*n rules if I'm seeing this correctly.

So adding syntactic sugar may be useful. Also, it should probably return the separator that follows an item rather than the separator that precedes it (at least for my use case).

I'm currently successfully using the list<> rule given above. Very elegant. For reference, the resulting code is:

    pub rule cmdline() -> Result<CommandLine, &'input str>
      = delimited_cmdline: list(<pipeline()>, <pipeline_separator()>) {
            let (pipe0, rest) = delimited_cmdline;
            let mut pipelines = vec![pipe0?];

            for (i, (sep, pipe)) in rest.into_iter().enumerate() {
                if matches!(sep, "&") {
                    let mut last = &mut pipelines[i];
                    last.bg_job = true;
                }
                pipelines.push(pipe.unwrap());
            }

            Ok(CommandLine {
                pipelines
            })
        }

    rule pipeline_separator() -> &'input str
        = $(";") / $("&")
kevinmehall commented 3 years ago

Also, it should probably return the separator that follows an item rather than the separator that precedes it (at least for my use case).

Yeah, one argument against making this some kind of built-in syntax is the number of different return types you might want, depending on how the separators associate with the items and whether empty lists and leading/trailing separators should be allowed:

godmar commented 3 years ago

The better alternative may then in fact be to improve documentation for the technique that uses rule; the user should be able to quickly create the variant that's best for them from the example if it's included in the README.