Open agladysh opened 1 year ago
One way to fix ... Parser already expects production rules to be delimited by newline. Parser.feed()
could stash the characters in the chunk that are not terminated by newline and prepend it to the next chunk for processing.
Making the below change at https://github.com/kach/nearley/blob/master/lib/nearley.js#L279, works for your grammar.
This should work without additional ripple effects, since nearleyc
appends a newline at the end of the feed for good measure - https://github.com/kach/nearley/blob/master/bin/nearleyc.js#L31.
var tailPrevChunk='';
Parser.prototype.feed = function(chunk) {
var idxLastNewline = chunk.lastIndexOf('\n');
if (idxLastNewline === -1 ){
tailPrevChunk += chunk;
return;
}
var alteredChunk = tailPrevChunk + chunk.substring(0,idxLastNewline);
tailPrevChunk= chunk.substring(idxLastNewline);
var lexer = this.lexer;
lexer.reset(alteredChunk, this.lexerState);
. . .
Having shared that. It is worth debating if supporting an arbitrarily long production rule is a good idea. Seems to me, having a defined max length is a good idea.
@naraen Thank you, it is a good idea. This will likely fix parsing longer .ne files, but will this fix arbitrary user data with a Parser that nearleyc
compiled?
One can imagine a valid data format without any newlines (or even set delimiters). For such corner cases, though, we may let the user configure when to stop working with the current chunk.
Agreed. This isn't the right location in the code flow for the fix. I wanted to do a quick proof. Wondering ... https://github.com/kach/nearley/blob/master/lib/stream.js#L14-L15 might be a better place?
Opened a PR with the change in StreamWriter. As far as I can tell, only nearleyc uses the StreamWriter. So this should fix the issue identified without changing the behavior of the generated parser.
Consider this code in the
nearleyc
implementation: https://github.com/kach/nearley/blob/master/bin/nearleyc.js#L29It feeds the fs stream via a
StreamWrapper
instance to the parser's feed() function as is.The
Parser.feed()
implementation callslexer.reset()
with the new chunk.If the fs stream splits a token between two chunks, and we're lucky we will get a lexer error either at the end of the previous chunk, or at the beginning of the next chunk. (Otherwise the result is valid, but may not be what the user expects, as two tokens instead of one appear silently in the result.)
Ideally, the fs streams should be handled on lexer level, not on parser level. Barring that, the easiest solution, probably, would be to drop streaming support, at least in
nearleyc
, and deprecate the usage of streaming in the API.In practice, this issue affects users with longer grammars with longer symbol names.
Here is the code to reproduce (probably can be whittled down to something simpler):
gen-long-grammar.ts
Running this
Results in an error similar to this: