shnewto / bnf

Parse BNF grammar definitions
MIT License
256 stars 22 forks source link

Traversal Trees #120

Closed CrockAgile closed 1 year ago

CrockAgile commented 1 year ago

Closes #117 Closes #118 (hopefully! 🤞) Closes #119 Closes #115

Try Again!

This PR is the result of iterating on #119. #119 attempted to resolve the same issue, but @amascolo generously raised examples that still failed.

This PR (maybe!) resolves these additional cases, which are now also included as tests.

I will include the identical "root cause" section here as #119 so that this PR may be readable on its own.

Root Cause

Consider the grammar:

<start> ::= <shortfail> | <longsuccess>
<shortfail> ::= <char> 'never'
<char> ::= 'a'
<longsuccess> ::= <long2>
<long2> ::= <long3>
<long3> ::= <long4>
<long4> ::= <char>

When parsing input "a", there are two routes: "shortfail" and "longsuccess". As the names suggest, "shortfail" requires fewer traversals, but always fails. "longsuccess" requires more traversal steps, and should eventually succeed.

The issue is caused because both paths predict the non-terminal "char". Practical Earley parsing requires de-duplicating predictions or else recursive grammars fall into infinite loops. The existing Earley implementation in BNF does roughly the following:

(work roughly alternates between the short and long routes because new traversals are appended to the end of the work queue)

* <start> ::= • <shortfail>
* <start> ::= • <longsuccess>
* <shortfail> ::= • <char> 'never'
* <longsuccess> ::= • <long2>
* <char> ::= • 'a'
* <long2> ::= • <long3>
* <char> ::= 'a' •
* <long3> ::= • <long4>
* <shortfail> ::= <char> • 'never' // <--- notice this does NOT succeed
* <long4> ::= • <char> // !!! this <char> prediction is IGNORED because an identical prediction was already made

All the <longN> productions are necessary because otherwise the <longsuccess> route is able to predict before its completion.

I am sorry if I have not explained this super clearly. It is a tricky problem! Funny aside, I actually thought of this bug while developing v0.4.0 . But I (wrongly) assumed there was something about the Earley algorithm which resolved it. I attempted to write a test, but I did not realize it would require so many intermediate non-terminals to expose. Woops!

Existing Issues

Where this PR differs from #119 is its approach to "duplicate detection". Previously, duplicate Earley states/traversals were identified by which Production and how many Terms had been matched. This turns out to be insufficient, because partially "matched" Productions (i.e. <shortfail> ::= <char> • 'never') could have matched non-terminals via different paths.

New Duplicate Detection

Traversals now match/complete terms by building trees 🌳

A new prediction is the root trunk of a tree, and each matched/completed term adds a new branch. Assuming there are two different traversals which can complete base, a traversal tree segment may look like:

flowchart TD
0["dna ::= • base dna"] --> 1["dna ::= base=1 • dna"]
0["dna ::= • base dna"] --> 2["dna ::= base=2 • dna"]

Performance

On my machine, there was seemingly no performance cost to these changes. I believe there was a cost to adding the new logic for "prior completed" traversals. But that cost was offset by the improvement of traversal trees, instead of reference counted term matching vectors.

Extra

Basically every time I have worked on an Earley bug, I have ended up adding the same manual logging to help with debugging. I decided to commit that logging this time!

There is a new tracing::event! which by default is a noop, but with the "tracing" feature enabled adds logging events.

For Earley, traversal state events are logged when created and during predict/scan/complete. It helps quite a bit with debugging!

CrockAgile commented 1 year ago

I again plan to leave this PR open to give some time for review. Also I still have to read it all over myself with fresh eyes, and add a lot of comments 😅

coveralls commented 1 year ago

Coverage Status

Coverage: 91.063% (-1.2%) from 92.231% when pulling baf5a0907897af4f76fd381a0bee9aa1e268d58f on right-recursive-failure-new into 5a515853b9110acb7d6c7ff5b3fa18ba0ef5aa15 on main.

CrockAgile commented 1 year ago

snapshot testing has exposed some nondeterminism in the Earley parsing. the parse trees are valid, but ambiguous grammars may parse inputs in inconsistent orders between test executions.

I will investigate this (likely a Hashmap or something 🙏). Otherwise, this PR is holding up to my local scrutiny. Once I resolve the nondeterminism issue, I plan to merge 🎊

amascolo commented 1 year ago

@CrockAgile thanks again, amazing to see these fixes merged – any chance of releasing them as 0.4.4?