shnewto / bnf

Parse BNF grammar definitions
MIT License
256 stars 22 forks source link

Detect and auto complete "nullable" productions #113

Closed CrockAgile closed 1 year ago

CrockAgile commented 1 year ago

Closes #112

What

112 noted that empty/"nullable" parsing was encountering some issues. @Loup's Blog explains the root of the problem better than I can here. But the solution can be summarized: detect which non-terminals can be "null", and when "predicting", also automatically complete them.

Detecting null productions is achievable. But BNF provides parse trees. So rather than simply "skipping" nullable nonterminals, BNF should provide how the nonterminal was nullable (possibly by many ways!).

This required a significant refactor because currently matching nonterminals requires a corresponding State. But nullable rules have no directly associated State !

enum TermMatch<'gram> {
    Terminal(&'gram str),
    NonTerminal(StateId), // <--- UH OH
}

So I must apologize because this is much too large of a PR πŸ˜“ I will do better next time!

Changes

Tracing

Tracing helped me a lot when tracking down a much larger performance regression that was introduced during development.

automatica tracing is very helpful, but can be a bit much of noise. And because it uses rust debug symbols, the output can be pretty messy. That said, it does capture A LOT

image

tracing-flame allows for more "manual" flamegraphs, only using spans specifically made by developers allows for more intentional data, but often misses out due to human oversight.

image

Performance

parsing benchmarks from my local laptop:

// before
parse polish calculator time:   [497.64 Β΅s 578.29 Β΅s 662.03 Β΅s]                 

// after
parse polish calculator time:   [564.33 Β΅s 650.73 Β΅s 739.58 Β΅s]                                    
                        change: [-13.197% +14.678% +49.938%] (p = 0.35 > 0.05)
                        No change in performance detected.

So criterion isn't confident, but after many personal trial runs it does seem that +14% performance hit is "real".

That said, I do have ideas of how to improve:

Thanks

I want to thank @nskobelevs and @amascolo for helping by raising the problem and contributing good test cases! I could not have put this PR together without you <3

CrockAgile commented 1 year ago

Oh also some example output to celebrate parsing BNF with optional whitespace

<syntax> ::= <rule>
└── <rule> ::= <opt-whitespace> "<" <rule-name> ">" <opt-whitespace> "::=" <opt-whitespace> <expression> <line-end>
    β”œβ”€β”€ <opt-whitespace> ::= ""
    β”‚   └── ""
    β”œβ”€β”€ "<"
    β”œβ”€β”€ <rule-name> ::= <rule-name> <rule-char>
    β”‚   β”œβ”€β”€ <rule-name> ::= <rule-name> <rule-char>
    β”‚   β”‚   β”œβ”€β”€ <rule-name> ::= <rule-name> <rule-char>
    β”‚   β”‚   β”‚   β”œβ”€β”€ <rule-name> ::= <rule-name> <rule-char>
    β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ <rule-name> ::= <letter>
    β”‚   β”‚   β”‚   β”‚   β”‚   └── <letter> ::= "s"
    β”‚   β”‚   β”‚   β”‚   β”‚       └── "s"
    β”‚   β”‚   β”‚   β”‚   └── <rule-char> ::= <letter>
    β”‚   β”‚   β”‚   β”‚       └── <letter> ::= "t"
    β”‚   β”‚   β”‚   β”‚           └── "t"
    β”‚   β”‚   β”‚   └── <rule-char> ::= <letter>
    β”‚   β”‚   β”‚       └── <letter> ::= "a"
    β”‚   β”‚   β”‚           └── "a"
    β”‚   β”‚   └── <rule-char> ::= <letter>
    β”‚   β”‚       └── <letter> ::= "r"
    β”‚   β”‚           └── "r"
    β”‚   └── <rule-char> ::= <letter>
    β”‚       └── <letter> ::= "t"
    β”‚           └── "t"
    β”œβ”€β”€ ">"
    β”œβ”€β”€ <opt-whitespace> ::= ""
    β”‚   └── ""
    β”œβ”€β”€ "::="
    β”œβ”€β”€ <opt-whitespace> ::= " " <opt-whitespace>
    β”‚   β”œβ”€β”€ " "
    β”‚   └── <opt-whitespace> ::= ""
    β”‚       └── ""
    β”œβ”€β”€ <expression> ::= <list> <opt-whitespace> "|" <opt-whitespace> <expression>
    β”‚   β”œβ”€β”€ <list> ::= <term> <opt-whitespace> <list>
    β”‚   β”‚   β”œβ”€β”€ <term> ::= "<" <rule-name> ">"
    β”‚   β”‚   β”‚   β”œβ”€β”€ "<"
    β”‚   β”‚   β”‚   β”œβ”€β”€ <rule-name> ::= <letter>
    β”‚   β”‚   β”‚   β”‚   └── <letter> ::= "a"
    β”‚   β”‚   β”‚   β”‚       └── "a"
    β”‚   β”‚   β”‚   └── ">"
    β”‚   β”‚   β”œβ”€β”€ <opt-whitespace> ::= ""
    β”‚   β”‚   β”‚   └── ""
    β”‚   β”‚   └── <list> ::= <term> <opt-whitespace> <list>
    β”‚   β”‚       β”œβ”€β”€ <term> ::= "<" <rule-name> ">"
    β”‚   β”‚       β”‚   β”œβ”€β”€ "<"
    β”‚   β”‚       β”‚   β”œβ”€β”€ <rule-name> ::= <letter>
    β”‚   β”‚       β”‚   β”‚   └── <letter> ::= "b"
    β”‚   β”‚       β”‚   β”‚       └── "b"
    β”‚   β”‚       β”‚   └── ">"
    β”‚   β”‚       β”œβ”€β”€ <opt-whitespace> ::= " " <opt-whitespace>
    β”‚   β”‚       β”‚   β”œβ”€β”€ " "
    β”‚   β”‚       β”‚   └── <opt-whitespace> ::= ""
    β”‚   β”‚       β”‚       └── ""
    β”‚   β”‚       └── <list> ::= <term>
    β”‚   β”‚           └── <term> ::= "<" <rule-name> ">"
    β”‚   β”‚               β”œβ”€β”€ "<"
    β”‚   β”‚               β”œβ”€β”€ <rule-name> ::= <letter>
    β”‚   β”‚               β”‚   └── <letter> ::= "c"
    β”‚   β”‚               β”‚       └── "c"
    β”‚   β”‚               └── ">"
    β”‚   β”œβ”€β”€ <opt-whitespace> ::= ""
    β”‚   β”‚   └── ""
    β”‚   β”œβ”€β”€ "|"
    β”‚   β”œβ”€β”€ <opt-whitespace> ::= " " <opt-whitespace>
    β”‚   β”‚   β”œβ”€β”€ " "
    β”‚   β”‚   └── <opt-whitespace> ::= ""
    β”‚   β”‚       └── ""
    β”‚   └── <expression> ::= <list>
    β”‚       └── <list> ::= <term>
    β”‚           └── <term> ::= "<" <rule-name> ">"
    β”‚               β”œβ”€β”€ "<"
    β”‚               β”œβ”€β”€ <rule-name> ::= <letter>
    β”‚               β”‚   └── <letter> ::= "d"
    β”‚               β”‚       └── "d"
    β”‚               └── ">"
    └── <line-end> ::= <opt-whitespace> <EOL>
        β”œβ”€β”€ <opt-whitespace> ::= ""
        β”‚   └── ""
        └── <EOL> ::= "
"
            └── "
"
flowchart TD
0["syntax"] --> 1["rule"]
1["rule"] --> 2["opt-whitespace"]
2["opt-whitespace"] --> 3[" "]
1["rule"] --> 4["<"]
1["rule"] --> 5["rule-name"]
5["rule-name"] --> 6["rule-name"]
6["rule-name"] --> 7["rule-name"]
7["rule-name"] --> 8["rule-name"]
8["rule-name"] --> 9["rule-name"]
9["rule-name"] --> 10["letter"]
10["letter"] --> 11["s"]
8["rule-name"] --> 12["rule-char"]
12["rule-char"] --> 13["letter"]
13["letter"] --> 14["t"]
7["rule-name"] --> 15["rule-char"]
15["rule-char"] --> 16["letter"]
16["letter"] --> 17["a"]
6["rule-name"] --> 18["rule-char"]
18["rule-char"] --> 19["letter"]
19["letter"] --> 20["r"]
5["rule-name"] --> 21["rule-char"]
21["rule-char"] --> 22["letter"]
22["letter"] --> 23["t"]
1["rule"] --> 24[">"]
1["rule"] --> 25["opt-whitespace"]
25["opt-whitespace"] --> 26[" "]
1["rule"] --> 27["::="]
1["rule"] --> 28["opt-whitespace"]
28["opt-whitespace"] --> 29[" "]
28["opt-whitespace"] --> 30["opt-whitespace"]
30["opt-whitespace"] --> 31[" "]
1["rule"] --> 32["expression"]
32["expression"] --> 33["list"]
33["list"] --> 34["term"]
34["term"] --> 35["<"]
34["term"] --> 36["rule-name"]
36["rule-name"] --> 37["letter"]
37["letter"] --> 38["a"]
34["term"] --> 39[">"]
33["list"] --> 40["opt-whitespace"]
40["opt-whitespace"] --> 41[" "]
33["list"] --> 42["list"]
42["list"] --> 43["term"]
43["term"] --> 44["<"]
43["term"] --> 45["rule-name"]
45["rule-name"] --> 46["letter"]
46["letter"] --> 47["b"]
43["term"] --> 48[">"]
42["list"] --> 49["opt-whitespace"]
49["opt-whitespace"] --> 50[" "]
49["opt-whitespace"] --> 51["opt-whitespace"]
51["opt-whitespace"] --> 52[" "]
42["list"] --> 53["list"]
53["list"] --> 54["term"]
54["term"] --> 55["<"]
54["term"] --> 56["rule-name"]
56["rule-name"] --> 57["letter"]
57["letter"] --> 58["c"]
54["term"] --> 59[">"]
32["expression"] --> 60["opt-whitespace"]
60["opt-whitespace"] --> 61[" "]
32["expression"] --> 62["|"]
32["expression"] --> 63["opt-whitespace"]
63["opt-whitespace"] --> 64[" "]
63["opt-whitespace"] --> 65["opt-whitespace"]
65["opt-whitespace"] --> 66[" "]
32["expression"] --> 67["expression"]
67["expression"] --> 68["list"]
68["list"] --> 69["term"]
69["term"] --> 70["<"]
69["term"] --> 71["rule-name"]
71["rule-name"] --> 72["letter"]
72["letter"] --> 73["d"]
69["term"] --> 74[">"]
1["rule"] --> 75["line-end"]
75["line-end"] --> 76["opt-whitespace"]
76["opt-whitespace"] --> 77[" "]
75["line-end"] --> 78["EOL"]
78["EOL"] --> 79["
"]
coveralls commented 1 year ago

Coverage Status

Coverage decreased (-0.8%) to 92.114% when pulling 1288a4b015ae5e67e8e62bad6b12df04009ee615 on empty-fix into b0a53dbde5db4013c4f1db71e0b4b793ad363c25 on main.

amascolo commented 1 year ago

This PR breaks a few things, here's an example that passes on main but fails on empty-fix:

#[test]
fn test() {
    let grammar: Grammar = "
    <terms> ::= <terms> <ws> <term>
              | <term>

    <term> ::= <qualified>
             | 'unqualified'

    <qualified> ::= 'QUALIFIER:' <qual-term>

    <qual-term> ::= <qual-term> <ws>
                  | 'qualified'

    <ws> ::= ' ' | ' ' <ws>
    "
    .parse()
    .unwrap();

    let input = "QUALIFIER:qualified unqualified";

    assert!(grammar.parse_input(input).next().is_some(), "can't parse: {input}");
}

(Suggestion: may be worth adding some of these cases as unit tests, to help catch future regressions πŸ™‚ )

CrockAgile commented 1 year ago

Thanks! I will start looking at this example now. And thanks for the suggestion. Every test case that was shared I've tried to add to the tests, but might have missed a few

CrockAgile commented 1 year ago

thanks for the help @amascolo ! the issue was unrelated to "nullable" productions work. the source was incorrect order of operations for Earley parsing.

traversals must be added to the "incomplete map" immediately when created, not when later predicting

because identical predictions may have already been made, those predictions would "complete" before other "incomplete" traversals were available

TLDR: fixed the issue. Earley is order dependent (when de-duplicating work)

amascolo commented 1 year ago

@CrockAgile thanks for the quick turnaround, it gives me confidence that bnf is a solid library to build my projects on!

Can confirm this PR fixes nullable productions in the grammar I'm writing, without breaking it anywhere else 😁

CrockAgile commented 1 year ago

this fix should now be available under v0.4.3

but this was my first time publishing this crate, so let me know if something is broken πŸ˜