kach / nearley

📜🔜🌲 Simple, fast, powerful parser toolkit for JavaScript.
https://nearley.js.org
MIT License
3.59k stars 232 forks source link

Implementing protocol parsers with Nearley? #586

Open joepie91 opened 3 years ago

joepie91 commented 3 years ago

Hi,

I've been looking into Nearley as an option for implementing parsers for various network protocols - but one thing I can't seem to figure out is how to do an actual streaming parse, not just in the sense of providing inputs piecemeal, but also getting outputs piecemeal.

Typically a network protocol will be interactive, and so it will require the protocol parser to fully parse individual messages (for whatever definition of "messages" the network protocol in question uses) and provide those to the calling code, without "ending" the parser, so to speak.

(An extreme example of a streaming parser, though somewhat ironically not a network protocol, would be sax-style XML parsers, which emit a parse item for every single entered and exited XML tag.)

But as far as I can tell, Nearley is not able to do this with the current parsing API - it seems that it will only provide a final result (or potentially multiple ambiguous results) once, after the entire parse has completed. This would not work with a network protocol in practice, as I would essentially have to write an entirely separate parser (which may need to share logic with the Nearley parser depending on protocol complexity) just to split the incoming byte stream into separate 'messages' to parse.

What would be the right way to approach this with Nearley? Or does it just not support actual streaming parsers yet?

cliffspradlin commented 2 years ago

I had the same issue. I was a bit tricked because I thought that streaming meant streaming in the way you thought too -- by the time I realized nearley.js doesn't work this way, I had a fully working parser --- except for this issue.

Actually you're wrong about one part - nearley.js will give you the partial result after calling feed(), via nearley.results. And if you have an ambiguous grammar you need to fix that because that'll cause all sorts of problems.

What you want is some way to clear earlier state after you've successfully processed a 'message' or a 'line' or something. Simply put, nearley.js doesn't support this -- I don't know much about the theory of parsing, but conceptually you would need some way of flagging some level of the grammar as continuously growing but that you don't want previous tokens to be preserved the next time feed() is called.

I do have a workaround that works quite well for me in practice. Here's the relevant top-level "ever-growing" part of my grammar:

stream -> line:+ {% id %}

And the parsing code:

  try {
      this.nearley.feed(chunk);
  } catch (err) {
      console.log("parse error in chunk \"" + chunk + "\":", err);
      return;
  }
  const results = this.nearley.results;
  if (results.length == 1) {
      for (const tokenLine of results[0]) {
          this._parseLine(tokenLine);
      }
  } else if (results.length > 1) {
      console.log("multiple results:", results);
  }

  // weird hack: clear already consumed lines from parser state so that
  // they don't accumulate forever.
  const states = this.nearley.table[this.nearley.current].states;
  for (const state of states) {
      if (state.rule.name.startsWith("stream")) {
          state.data = [];
      }
  }

The 'weird hack' at the end prevents lines from accumulating in the top-level rule, and seems to keep memory and performance under control. Hope this helps.