Case-insensitive literal matches throw off error position

emk commented 11 months ago

First, I want to thank you for my absolute favorite Rust parser library. I've written so many little peg parsers for all kinds of things in so many open source tools. peg is an amazing tool.

I'm trying to write a parser that records token locations and whitespace, and I've run into a weird problem that throws off error positions. I'm trying to use the basic trick in #216, and it winds up looking like this:

        /// Keywords. These use case-insensitive matching, and may not be
        /// followed by a valid identifier character.
        rule k(kw: &'static str) -> Token
            = s:position!() input:$([_]*<{kw.len()}>) ws:$(_) e:position!() {?
                if input.eq_ignore_ascii_case(kw) {
                    Ok(Token {
                        span: s..e,
                        ws_offset: input.len(),
                        text: format!("{}{}", input, ws),
                    })
                } else {
                    Err(kw)
                }
            }

However, $([_]*<{kw.len()}>) will match all the way to end of kw.len(). In a language that has some long keywords, this means that errors may be reported far to the right of where they actually occur.

This version is better:

        rule k(kw: &'static str) -> Token
            = quiet! { s:position!() input:$([_]*<{kw.len()}>) ws:$(_) e:position!() {?
                if input.eq_ignore_ascii_case(kw) {
                    Ok(Token {
                        span: s..e,
                        ws_offset: input.len(),
                        text: format!("{}{}", input, ws),
                    })
                } else {
                    Err(kw)
                }
            } }
            / expected!("keyword")

...but it can only ever report expected keyword, not expected SELECT, DROP, or CREATE.

I tried replacing / expected!("keyword") with / {? Err(keyword) }. Unfortunately, this triggers an infinite loop in the grammar, because peg can't see the guaranteed failure and thinks it's an empty rule.

Then I started to try weird things, like positive lookahead:

/// A function that compares two strings for equality.
type StrCmp = dyn Fn(&str, &str) -> bool;

        /// Tricky zero-length rule for asserting the next token without
        /// advancing the parse location.
        rule want(s: &'static str, cmp: &StrCmp)
            = &(found:$([_]*<{s.len()}>) {?
                if cmp(found, s) {
                    Ok(())
                } else {
                    Err(s)
                }
            })

        rule k(kw: &'static str) -> Token
            = want(kw, &str::eq_ignore_ascii_case)
              s:position!() input:$([_]*<{kw.len()}>) ws:$(_) e:position!()
            {
                if !KEYWORDS.contains(kw) {
                    panic!("BUG: {:?} is not in KEYWORDS", kw);
                }
                Token {
                    span: s..e,
                    ws_offset: input.len(),
                    text: format!("{}{}", input, ws),
                }
            }

This does other weird things, like fail with expected <unreported>, because apparently positive look-ahead is silent on failure?

Anyway, I'll try a few more things and see how it goes. Maybe want(..) should match twice, once as positive lookahead and a second time for real?

emk commented 11 months ago

Hmm, there's no obvious way to fix want without either silently losing errors from k, or matching too far ahead.

Interestingly, I strongly suspect that support for / expected!(kw) would be enough to fix this pretty cleanly. Then I could use quiet! { ... } / expected!(...), which seems to provide the most accurate positions for errors out of all the alternatives.

Thank you for any thoughts or suggestions! And thank you for a great parser.

kevinmehall commented 11 months ago

expected!(non_literal_expr) seems like a good idea and shouldn't be hard to support.

If you want something that works on the current version, you could trick the infinite loop detection with {? Err(...) } "no_match" or even {? Err(...) } "" (literals are always considered non-nullable without checking if they're empty). The loop detection is supposed to conservatively avoid rejecting non-looping code, so {? ..} should probably be relaxed -- right now it's not handled separately from unconditional sequences.

This would also be a good use case for https://github.com/kevinmehall/rust-peg/issues/284 once added.

emk commented 11 months ago

Thank you for the suggestions!

The no_match trick seems to fail with an error, though:

error: expected one of "#", "/", ";", "crate", "pub", "rule", "use", "}"
    --> src/ast.rs:2510:27
     |
2510 |             / {? Err(s) } "no_match"
     |                           ^^^^^^^^^^

kevinmehall commented 11 months ago

Oops, try (({? Err(s) }) "no_match"). Normally there wouldn't be any reason to put another expression after a block, but what this is doing is giving the loop checker something in sequence that would consume input. It's after the expression has already failed, so doesn't affect what it matches.

kevinmehall commented 11 months ago

https://github.com/kevinmehall/rust-peg/commit/a85e71b1c10b1dd2928842a647cd7fa7c2e5ad74 allows expected!() to take an expression evaluating to &'static str rather than just a literal. Released in 0.8.2.

emk commented 11 months ago

Thank you, this is excellent! This will allow me to give much better errors in certain parsers. As always, peg is fantastic.

kevinmehall / rust-peg

Case-insensitive literal matches throw off error position #361