kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.59k stars 232 forks source link

Valid string doesn't parse #159

Closed Seeker14491 closed 7 years ago

Seeker14491 commented 7 years ago

I compiled the following grammar with nearleyc:

S -> "a" "a" C "b" C
   | "a" C "b" C "a" C
   | "b" C "a" C "a" C

C -> C "a" C "a" C
   | "b" C
   | "b"
   | null

Nearley fails to parse the string "baab" when using this grammar. It's simple to show "baab" is a part of this language:

S
"b" C "a" C "a" C
"b" (null) "a" (null) "a" "b"
"b" "a" "a" "b"

I know this grammar might be over-complicated; I didn't write it. I'm simply using nearley to test it.

kach commented 7 years ago

Fixed. It was stupidity on my part regarding the Leo optimization. Thanks for reporting, and do let me know if anything else broke.

kach commented 7 years ago

(Update to v2.7.12 for the fix.)

baudehlo commented 7 years ago

Note this fix makes the grammar for haraka/node-address-rfc2822 hang forever running the test suite. I don't know if I've coded the grammar badly or not (it's mostly a copy of the EBNF for email addresses), but it didn't hang before.

kach commented 7 years ago

Oh no! Can you point to exactly which grammar fails, and with which input? (Needless to say, you should revert to v2.7.11 for the moment — I'll try to push a fix ASAP.)

baudehlo commented 7 years ago

It fails parsing: "spickett@tiac.net" Sean.Pickett@zork.tiac.net

I can start parsing at name_addr (the tightest definition of what should match) and it hangs. The default is to parse it as "from".

Interestingly parsing: "Joe & J. Harvey" <ddd @Org> succeeds (the previous entry in the test suite). But it produces over 2000 results, rather than the expected 16 which 2.7.11 produces.

On Tue, Feb 14, 2017 at 6:12 PM, Hardmath123 notifications@github.com wrote:

Reopened #159 https://github.com/Hardmath123/nearley/issues/159.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Hardmath123/nearley/issues/159#event-962262006, or mute the thread https://github.com/notifications/unsubscribe-auth/AAobYyd2w1rkiGr1W8-Zo0pLVx0RQFx9ks5rcjTLgaJpZM4L-Uc0 .

kach commented 7 years ago

I have good news and bad news!

The good news: I don't think there's a bug in nearley. I ran it again with Leo optimizations disabled completely, and got the same behavior.

The bad news (which might be good news in disguise): Your grammar is very ambiguous. That basically means that the same string can be parsed many different ways. The fact that Joe's email address used to produce 16 results is a hint in this direction, but as further proof I compiled it with --nojs to eliminate postprocessors. Then, parsing the address a@b gives the following results:

Parse results: 
[ [ [ [ [ [ [ [], [ [ 'a' ], [] ], [] ] ],
          '@',
          [ [ [], [ [ 'b' ], [] ], [] ] ] ] ],
      [] ] ],
  [ [ [ [ [ [ [ [ [], [ 'a' ], [] ] ], [] ] ],
          '@',
          [ [ [], [ [ 'b' ], [] ], [] ] ] ] ],
      [] ] ],
  [ [ [ [ [ [ [], [ [ 'a' ], [] ], [] ] ],
          '@',
          [ [ [ [], [ 'b' ], [] ], [] ] ] ] ],
      [] ] ],
  [ [ [ [ [ [ [ [ [], [ 'a' ], [] ] ], [] ] ],
          '@',
          [ [ [ [], [ 'b' ], [] ], [] ] ] ] ],
      [] ] ],
  [ [ [ [ [ [ [ [], [ [ 'a' ], [] ], [] ] ],
            '@',
            [ [ [], [ [ 'b' ], [] ], [] ] ] ] ] ],
      [] ] ],
  [ [ [ [ [ [ [ [ [ [], [ 'a' ], [] ] ], [] ] ],
            '@',
            [ [ [], [ [ 'b' ], [] ], [] ] ] ] ] ],
      [] ] ],
  [ [ [ [ [ [ [ [], [ [ 'a' ], [] ], [] ] ],
            '@',
            [ [ [ [], [ 'b' ], [] ], [] ] ] ] ] ],
      [] ] ],
  [ [ [ [ [ [ [ [ [ [], [ 'a' ], [] ] ], [] ] ],
            '@',
            [ [ [ [], [ 'b' ], [] ], [] ] ] ] ] ],
      [] ] ] ]

You can clearly tell that each item is different. So there is ambiguity in your grammar. :-(

Let's narrow down the issue. Consider, for example, the local_part rule. It says that local_part -> dot_atom {% id %} | quoted_string {% id %} | obs_local_part {% id %}. But, notice that the string "abc" matches both dot_atom and obs_local_part! Since nearley doesn't know which one you "intend" to match, it gives you two copies of a local_part match, one for each match:

$ nearley-test -q -s local_part -i "abc" grm.js 
[ [ [ [], [ [ 'a', [ 'b', [ 'c' ] ] ], [] ], [] ] ],
  [ [ [ [ [], [ 'a', [ 'b', [ 'c' ] ] ], [] ] ], [] ] ] ]

Ambiguity leads to very non-performant grammars, because nearley needs to manage all possible parsings, and that set grows exponentially.

Good luck fixing these things! Let me know if you get stuck.

(P.S. The email parser is a really cool project! :D)

baudehlo commented 7 years ago

Yes the grammar is ambiguous by design sadly. It would probably be better to use a recursive descent parser for it.

On Feb 14, 2017, at 7:36 PM, Hardmath123 notifications@github.com wrote:

I have good news and bad news!

The good news: I don't think there's a bug in nearley. I ran it again with Leo optimizations disabled completely, and got the same behavior.

The bad news (which might be good news in disguise): Your grammar is very ambiguous. That basically means that the same string can be parsed many different ways. The fact that Joe's email address used to produce 16 results is a hint in this direction, but as further proof I compiled it with --nojs to eliminate postprocessors. Then, parsing the address a@b gives the following results:

Parse results: [ [ [ [ [ [ [ [], [ [ 'a' ], [] ], [] ] ], '@', [ [ [], [ [ 'b' ], [] ], [] ] ] ] ], [] ] ], [ [ [ [ [ [ [ [ [], [ 'a' ], [] ] ], [] ] ], '@', [ [ [], [ [ 'b' ], [] ], [] ] ] ] ], [] ] ], [ [ [ [ [ [ [], [ [ 'a' ], [] ], [] ] ], '@', [ [ [ [], [ 'b' ], [] ], [] ] ] ] ], [] ] ], [ [ [ [ [ [ [ [ [], [ 'a' ], [] ] ], [] ] ], '@', [ [ [ [], [ 'b' ], [] ], [] ] ] ] ], [] ] ], [ [ [ [ [ [ [ [], [ [ 'a' ], [] ], [] ] ], '@', [ [ [], [ [ 'b' ], [] ], [] ] ] ] ] ], [] ] ], [ [ [ [ [ [ [ [ [ [], [ 'a' ], [] ] ], [] ] ], '@', [ [ [], [ [ 'b' ], [] ], [] ] ] ] ] ], [] ] ], [ [ [ [ [ [ [ [], [ [ 'a' ], [] ], [] ] ], '@', [ [ [ [], [ 'b' ], [] ], [] ] ] ] ] ], [] ] ], [ [ [ [ [ [ [ [ [ [], [ 'a' ], [] ] ], [] ] ], '@', [ [ [ [], [ 'b' ], [] ], [] ] ] ] ] ], [] ] ] ] You can clearly tell that each item is different. So there is ambiguity in your grammar. :-(

Let's narrow down the issue. Consider, for example, the local_part rule. It says that local_part -> dot_atom {% id %} | quoted_string {% id %} | obs_local_part {% id %}. But, notice that the string "abc" matches both dot_atom and obs_local_part! Since nearley doesn't know which one you "intend" to match, it gives you two copies of a local_part match, one for each match:

$ nearley-test -q -s local_part -i "abc" grm.js [ [ [ [], [ [ 'a', [ 'b', [ 'c' ] ] ], [] ], [] ] ], [ [ [ [ [], [ 'a', [ 'b', [ 'c' ] ] ], [] ] ], [] ] ] ] Ambiguity leads to very non-performant grammars, because nearley needs to manage all possible parsings, and that set grows exponentially.

Good luck fixing these things! Let me know if you get stuck.

(P.S. The email parser is a really cool project! :D)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

kach commented 7 years ago

Ah. Why would it be ambiguous by design?

(I'm closing this issue because we agreed it wasn't a bug. But I am curious to continue the conversation!)

baudehlo commented 7 years ago

Because it's an ebnf and so it's meant to be parsed first match to the left style.

A simple example is right at the top level. From is a mailbox list or address list. But a mailbox is an address or a group, so it's ambiguous right away and the ambiguity is there to support the design. It could probably be disambiguated but it would make the grammar much more complex.

On Feb 14, 2017, at 8:38 PM, Hardmath123 notifications@github.com wrote:

Closed #159.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

kach commented 7 years ago

Mmm. I am not too familiar with the grammar you're working with, but in my experience, disambiguating grammars makes them a lot simpler. In your example, couldn't you simply omit the redundant "address list" nonterminal to make that bit of the grammar unambiguous?

baudehlo commented 7 years ago

The aim was to mirror the grammar in the rfcs so that any future changes would be easy to make, and anyone else coming into it would understand the grammar. There's a few LL parsers on npm that I'll look into as replacements.

On Feb 15, 2017, at 12:55 AM, Hardmath123 notifications@github.com wrote:

Mmm. I am not too familiar with the grammar you're working with, but in my experience, disambiguating grammars makes them a lot simpler. In your example, couldn't you simply omit the redundant "address list" nonterminal to make that bit of the grammar unambiguous?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.