shnewto / bnf

Parse BNF grammar definitions
MIT License
256 stars 22 forks source link

Empty String Rules (Still) Fail to Match #112

Closed nskobelevs closed 1 year ago

nskobelevs commented 1 year ago

Describe the bug Empty string rules still fail to match in some situations. This is the same as #108 although after it's fix it doesn't happen in all situations but still happens.

To Reproduce

  1. Load grammar from tests/fixtures/bnf.bnf

  2. Try to parse the first line of the text: "<syntax> ::= <rule> | <rule> <syntax>\n" This match will fail.

  3. Try to parse the same line with whitespace before the newline "<syntax> ::= <rule> | <rule> <syntax> \n" This will parse properly

The first example should pass given <line-end> may begin with <opt-whitespace> which has an empty string rule

<line-end>       ::= <opt-whitespace> <EOL>
<opt-whitespace> ::= "" | " " <opt-whitespace>
<EOL>            ::= "
"

Unfortunately I haven't been able to find a smaller example. Weirdly it does parse "<syntax> ::= <rule>" correctly but not the more complicated example above.

If there support for special characters like \n or a built in definition for <EOL> it could be good to have a test that a grammer made from tests/fixtures/bnf.bnf can parse the file itself.

Desktop (please complete the following information):

shnewto commented 1 year ago

ah thanks for raising this issue! I really appreciate the details for reproducing!

I'll take a look 😄

CrockAgile commented 1 year ago

just to share some progress, I started looking into this today, and was able to reduce the bug test case to:

#[test]
fn parse_nested_empty() {
    let grammar: Grammar = "
    <start> ::= <a> <empty>
    <a> ::= 'a' <empty>
    <empty> ::= ''"
        .parse()
        .unwrap();

    let input = "a";

    let parses = parse(&grammar, input);
    assert_eq!(parses.count(), 1);
}

The bug seems to be caused by incorrectly de-duplicating Earley states. Because "empty" rules do not advance the input cursor, different predictions get treated as "duplicates".

I have a fix working locally, but it basically "solves" the problem by duplicating all parse attempts 😅

Next chance I get, I will brainstorm for a more realistic solution

amascolo commented 1 year ago

Any progress on a fix?

I may be hitting the same issue – here's another bare-bones example you can check against:

#[test]
fn test() {
    let grammar: Grammar = "
    <balanced> ::= <left> <balanced> <right>
                 | ''

    <left> ::= <opt-ws> '(' <opt-ws>
    <right> ::= <opt-ws> ')' <opt-ws>

    <opt-ws> ::= '' | <ws>
    <ws> ::= ' ' | ' ' <ws>
    "
    .parse()
    .unwrap();

    let input = "()";

    assert!(
        grammar.parse_input(input).next().is_some(),
        "can't parse: {input}"
    );
}

Greatly appreciate everybody's work so far!

Would love to use bnf as I don't see an established Rust library that supports Earley parsing.

However, this issue is a blocker for my use case 🙁

amascolo commented 1 year ago

@CrockAgile in https://github.com/shnewto/bnf/pull/92 you reference this tutorial, which has a section on empty rules: https://loup-vaillant.fr/tutorials/earley-parsing/empty-rules – maybe there's something helpful there, where it discusses Aycock and Horspool's (2002) paper?

If you get any insights from fixing this issue, it might be worth including the findings in your previous write-up (thanks again for that): https://gist.github.com/CrockAgile/d8cb79607b6c6c49c06a97bd652dc219

Either way, we are in good company – as mentioned by MARPA's author:

Earley's original 1970 algorithm actually had a bug in its handling of nullables. There was an easy fix, but it made the algorithm slower. Since efficiency was already seen as the reason to prefer other parsers, this was a big deal.

CrockAgile commented 1 year ago

Thank you for another example! I do have a working fix, based on the exact solution you linked. So glad we are on the same page.

But while implementing it, a surprising amount of refactoring was needed (allowing for Earley matches on "nullable productions", not just complete "States", if you are curious 😅 )

That refactor introduced a significant performance regression 😱

But I fixed that last night! So

TLDR: I should have it up for review tomorrow 🤞

CrockAgile commented 1 year ago

this fix should now be available under v0.4.3

but this was my first time publishing this crate, so let me know if something is broken 😅

nskobelevs commented 1 year ago

Looks good here - thank you for addressing this!