Large valid grammar causes nearleyc chunking/streaming to fail

AnyhowStep commented 3 years ago

Repro, https://github.com/AnyhowStep/nearley-2.20.1-large-grammar-file-bug

Original comment, https://github.com/kach/nearley/issues/358#issuecomment-714082295

Workarounds,

Downgrade to nearley@2.11.2
Copy 2.11.2's nearley-language-bootstrapped.js and replace 2.20.1's nearley-language-bootstrapped.js

Given enough repetitions of the following (or a large enough grammar file),


Identifier ->
    %Identifier {% function (_a) {
    var identifier = _a[0];
    return {
        start: identifier.start,
        end: identifier.end,
        syntaxKind: parser_node_1.SyntaxKind.Identifier,
        identifier: identifier.value,
        quoted: false,
    };
} %}

nearleyc will fail.

Error: invalid syntax at line 2562 col 17:

      %Identifier {% function (_a) {
                  ^
Unexpected input (lexer error). Instead, I was expecting to see one of the following:

A comment token based on:
    ws → ws$ebnf$1 ● %comment _
    expr → expr ● ws expr_member
    completeexpression →  ● expr
    expression+ →  ● completeexpression
    prod → word _ %arrow _ ● expression+
    prog →  ● prod
    prog → prod ws ● prog
    ^ 200 more lines identical to this
    final → _ ● prog _ final$ebnf$1
A "$" based on:
    expr_member →  ● "$" word
    expr → expr ws ● expr_member
    completeexpression →  ● expr
    expression+ →  ● completeexpression
    prod → word _ %arrow _ ● expression+
    prog →  ● prod
    prog → prod ws ● prog
    ^ 200 more lines identical to this
    final → _ ● prog _ final$ebnf$1
//snip
    at Parser.feed
    at StreamWrapper.write [as _write]

Of course, for a real grammar, it would be silly to copy-paste the same rule over and over again. However, this is a demonstration that the problem seems to be the size of the grammar, and not the complexity of the grammar.

The grammar file large-grammar.ne in the repro has 63,024 characters, and 2571 lines only.

AnyhowStep commented 3 years ago

I just made a better repro and added it to the repo above.

The following should work, but fails,

Identifier ->
    %identifier

# this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.
# this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.
//snip repeated until 66k characters reached

AnyhowStep commented 3 years ago

I noticed the error seemed to always reference,

at Parser.feed
at StreamWrapper.write [as _write]

So, I changed stream.js to,

StreamWrapper.prototype._write = function write(chunk, encoding, callback) {
    console.log("== chunk ==");
    console.log(chunk.toString());
    this._parser.feed(chunk.toString());
    callback();
};

And I only got two chunks.

The first chunk,

== chunk ==
Identifier ->
    %identifier

# this is a comment with 1,055 characters. //snip
//snip
# this is a comment with

And the second chunk was,

== chunk ==
1,055 characters. //snip

I decided to modify the large-grammar-3.ne file such that the second chunk would become #1,055 characters.

To my surprise, it compiled successfully.

In conclusion, I think the problem is with the chunking. If all my text fits in one "chunk", I do not get the error. If it splits into multiple chunks, and splits them "incorrectly", I get the error. If it splits into multiple chunks, and accidentally splits them "correctly", it works fine.

AnyhowStep commented 3 years ago

I think this might be what's happening,

Parser.feed("# this is a comment with ")
lexer.reset("# this is a comment with ")
lexer.next() returns token type comment
Parser.feed("1,")
lexer.reset("1,")
lexer.next() freaks out because it has no idea what token that is

AnyhowStep commented 3 years ago

i decided to look at the diff of 2.20.1 and 2.11.2.

I noticed this in the nearley-language-bootstrapped.ne file for 2.11.2, https://github.com/kach/nearley/blob/e476c6dceff5b7f72fe5fe57282e7e72bbbc8423/lib/nearley-language-bootstrapped.ne#L104-L106

Whereas 2.20.1 uses moo's lexer, https://github.com/kach/nearley/blob/6983001d85c3530f08407f955357ff6085806a66/lib/nearley-language-bootstrapped.ne#L18

So, nearley is probably able to handle streaming properly. But moo is not. Adding the moo lexer is probably what introduced this bug for large grammars.

https://github.com/kach/nearley/commit/421e5add9530cf7bf8a46d6feb90b5debbae3d2f#diff-7408903b4278e8ff809145a835b8d3f7a6cbae16011294d37c07d8ba12097c88

https://github.com/kach/nearley/pull/355/files

AnyhowStep commented 3 years ago

I've given it some thought and I can't think of a way to fix this easily. The options are,

Remove chunking (read the whole string, then pass it to Parser.feed())
Remove moo and revert back to a pure nearley solution
Find some way to integrate backtracking with lexers

As of this writing, my grammar file is about 300k characters in length.

For anyone who happens to come across this in future,

I replaced 2.20.1's nearley-language-bootstrapped.js with 2.11.2's nearley-language-bootstrapped.js and it worked on my end.

You can use patch-package to create an npm postinstall script that does the replacement.

Torcsi commented 3 years ago

A simplistic solution to increase highwatermark, nearlyc line 20 var input = opts.args[0] ? fs.createReadStream(opts.args[0],{highWaterMark:1024000}) : process.stdin;

kach / nearley

Large valid grammar causes nearleyc chunking/streaming to fail #575