The streaming implementation does not take in account that a token may be split across two chunks

agladysh commented 1 year ago

Consider this code in the nearleyc implementation: https://github.com/kach/nearley/blob/master/bin/nearleyc.js#L29

It feeds the fs stream via a StreamWrapper instance to the parser's feed() function as is.

The Parser.feed() implementation calls lexer.reset() with the new chunk.

If the fs stream splits a token between two chunks, and we're lucky we will get a lexer error either at the end of the previous chunk, or at the beginning of the next chunk. (Otherwise the result is valid, but may not be what the user expects, as two tokens instead of one appear silently in the result.)

Ideally, the fs streams should be handled on lexer level, not on parser level. Barring that, the easiest solution, probably, would be to drop streaming support, at least in nearleyc, and deprecate the usage of streaming in the API.

In practice, this issue affects users with longer grammars with longer symbol names.

Here is the code to reproduce (probably can be whittled down to something simpler):

gen-long-grammar.ts

process.stdout.write(`
grammar ->
`);

const rules = [ 'FOO', 'FOOBAR', 'FOOBARBAZ', 'FOOBARBAZQUO' ].map(rule => rule + '_' + 'X'.repeat(100));

for (let i = 0; i < 400; ++i) {
  process.stdout.write(i === 0 ? ' ' : '|');
  for (let j = 0; j < (Math.random() * 50 | 0) + 1; ++j) {
    process.stdout.write(' ');
    if (j > 0) {
      process.stdout.write('" " ');
    }
    process.stdout.write(rules[rules.length * Math.random() | 0]);
  }
  process.stdout.write('\n');
}

for (const rule of rules) {
  process.stdout.write(`\n${rule} -> "${rule}"`);
}

process.stdout.write('\n');

Running this

npx tsx gen-long-grammar.ts | npx nearleyc

Results in an error similar to this:

Unexpected word token: "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX". 
Instead, I was expecting to see one of the following:
<...>

naraen commented 1 year ago

One way to fix ... Parser already expects production rules to be delimited by newline. Parser.feed() could stash the characters in the chunk that are not terminated by newline and prepend it to the next chunk for processing.

Making the below change at https://github.com/kach/nearley/blob/master/lib/nearley.js#L279, works for your grammar.

This should work without additional ripple effects, since nearleyc appends a newline at the end of the feed for good measure - https://github.com/kach/nearley/blob/master/bin/nearleyc.js#L31.

    var tailPrevChunk='';
    Parser.prototype.feed = function(chunk) {
        var idxLastNewline = chunk.lastIndexOf('\n');
        if (idxLastNewline === -1 ){
            tailPrevChunk += chunk;
            return;
        }

        var alteredChunk = tailPrevChunk + chunk.substring(0,idxLastNewline);
        tailPrevChunk= chunk.substring(idxLastNewline);

        var lexer = this.lexer;
        lexer.reset(alteredChunk, this.lexerState);

         . . .

Having shared that. It is worth debating if supporting an arbitrarily long production rule is a good idea. Seems to me, having a defined max length is a good idea.

agladysh commented 1 year ago

@naraen Thank you, it is a good idea. This will likely fix parsing longer .ne files, but will this fix arbitrary user data with a Parser that nearleyc compiled?

One can imagine a valid data format without any newlines (or even set delimiters). For such corner cases, though, we may let the user configure when to stop working with the current chunk.

naraen commented 1 year ago

Agreed. This isn't the right location in the code flow for the fix. I wanted to do a quick proof. Wondering ... https://github.com/kach/nearley/blob/master/lib/stream.js#L14-L15 might be a better place?

naraen commented 1 year ago

Opened a PR with the change in StreamWriter. As far as I can tell, only nearleyc uses the StreamWriter. So this should fix the issue identified without changing the behavior of the generated parser.

kach / nearley

The streaming implementation does not take in account that a token may be split across two chunks #632