Open AnyhowStep opened 3 years ago
I just made a better repro and added it to the repo above.
The following should work, but fails,
Identifier ->
%identifier
# this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.
# this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.this is a comment with 1,055 characters. this is a comment with 1,055 characters.
//snip repeated until 66k characters reached
I noticed the error seemed to always reference,
at Parser.feed
at StreamWrapper.write [as _write]
So, I changed stream.js
to,
StreamWrapper.prototype._write = function write(chunk, encoding, callback) {
console.log("== chunk ==");
console.log(chunk.toString());
this._parser.feed(chunk.toString());
callback();
};
And I only got two chunks.
The first chunk,
== chunk ==
Identifier ->
%identifier
# this is a comment with 1,055 characters. //snip
//snip
# this is a comment with
And the second chunk was,
== chunk ==
1,055 characters. //snip
I decided to modify the large-grammar-3.ne
file such that the second chunk would become #1,055 characters
.
To my surprise, it compiled successfully.
In conclusion, I think the problem is with the chunking. If all my text fits in one "chunk", I do not get the error. If it splits into multiple chunks, and splits them "incorrectly", I get the error. If it splits into multiple chunks, and accidentally splits them "correctly", it works fine.
I think this might be what's happening,
Parser.feed("# this is a comment with ")
lexer.reset("# this is a comment with ")
lexer.next()
returns token type comment
Parser.feed("1,")
lexer.reset("1,")
lexer.next()
freaks out because it has no idea what token that isi decided to look at the diff of 2.20.1 and 2.11.2.
I noticed this in the nearley-language-bootstrapped.ne
file for 2.11.2,
https://github.com/kach/nearley/blob/e476c6dceff5b7f72fe5fe57282e7e72bbbc8423/lib/nearley-language-bootstrapped.ne#L104-L106
Whereas 2.20.1 uses moo's lexer, https://github.com/kach/nearley/blob/6983001d85c3530f08407f955357ff6085806a66/lib/nearley-language-bootstrapped.ne#L18
So, nearley is probably able to handle streaming properly. But moo
is not.
Adding the moo
lexer is probably what introduced this bug for large grammars.
I've given it some thought and I can't think of a way to fix this easily. The options are,
Parser.feed()
)moo
and revert back to a pure nearley solutionAs of this writing, my grammar file is about 300k characters in length.
For anyone who happens to come across this in future,
I replaced 2.20.1's nearley-language-bootstrapped.js
with 2.11.2's nearley-language-bootstrapped.js
and it worked on my end.
You can use patch-package
to create an npm postinstall
script that does the replacement.
A simplistic solution to increase highwatermark, nearlyc line 20 var input = opts.args[0] ? fs.createReadStream(opts.args[0],{highWaterMark:1024000}) : process.stdin;
Repro, https://github.com/AnyhowStep/nearley-2.20.1-large-grammar-file-bug
Original comment, https://github.com/kach/nearley/issues/358#issuecomment-714082295
Workarounds,
nearley@2.11.2
nearley-language-bootstrapped.js
and replace 2.20.1'snearley-language-bootstrapped.js
Given enough repetitions of the following (or a large enough grammar file),
nearleyc
will fail.Of course, for a real grammar, it would be silly to copy-paste the same rule over and over again. However, this is a demonstration that the problem seems to be the size of the grammar, and not the complexity of the grammar.
The grammar file
large-grammar.ne
in the repro has 63,024 characters, and 2571 lines only.