no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
821 stars 65 forks source link

Add Lexer#stream() #36

Open nathan opened 7 years ago

nathan commented 7 years ago

The reason we shouldn't just make Lexer a transform stream is so you can reuse a single lexer on multiple input streams.

However: with this implementation, horrible things will happen if you call stream() again before the first one is closed; specifically, the two streams will be interleaved based on which input stream is readable first. If you don't use states, you might not notice this unless you look at the offset/line/column information.

Should stream() automatically clone the lexer to avoid this behavior? Should we make Lexer a transform stream so you can just call clone() to get a new, independent stream?

stream() automatically calls reset() because switching to streaming halfway through your input doesn't sound like a good idea.

Closes #20.

tjvr commented 7 years ago

switching to streaming halfway through your input doesn't sound like a good idea.

Agreed!


This doesn't do anything about token boundaries, does it? I recall you raised some concerns about a stream API from #18:

the API is too generic to allow a useful stream wrapper.

it's incompatible with a stream API that doesn't re-lex the stream when it gets new data.

If you leave the task of pushing data only at token boundaries up to the user, then it's easy to add a stream interface (that's what feed() is, essentially). Packages like stream-split are good for, e.g., pushing the data line-by-line.

It's also possible, I guess, to re-lex the last token when we get new data, but you can still break this if you have two tokens which, when concatenated, form a different token; and this makes the API weird because you need to sometimes return a token twice.

nathan commented 7 years ago

This doesn't do anything about token boundaries, does it?

Correct. It is up to the user to push data at token boundaries.

tjvr commented 7 years ago

As for

Should stream() automatically clone the lexer to avoid this behavior?

Probably, yes. We haven't needed to you yet, but it probably would make sense for Lexers to keep track of lastIndex themselves; then clone() is cheap since we won't have to re-compile the RegExp.

nathan commented 7 years ago

Probably, yes.

Done.

tjvr commented 7 years ago

Are streams compelling? Would you use them? :-)

nathan commented 7 years ago

It is quite useful to be able to do this:

fs.createReadStream(wherever)
.pipe(split(/(\n)/))
.pipe(lexer.clone())
.pipe(new Parser())
.on('data', console.log)

(modulo disgusting .on('error', 
)s everywhere because the stream API is broken)

But it's somewhat more difficult—or perhaps just not conventional—to write Parser like this than it is to write a recursive descent parser.

EDIT: nearley might make this easier since it retains parse state between feed()s.

Here's an example that parses a sequence of s-expression-like things such as (foo (bar baz (qux))), and emits each one as it's parsed.

class Parser extends Transform {
  constructor() {
    super({objectMode: true})
    this.stack = []
    this.result = null
  }
  process(tok) {
    switch (tok.type) {
      case 'lpar':
        const inner = []
        if (this.result) {
          this.result.push(inner)
          this.stack.push(this.result)
        }
        this.result = inner
        break
      case 'rpar':
        if (!this.stack.length) {
          if (!this.result) {
            throw new Error("I'm not gonna match your parentheses for you")
          }
          this.push(this.result)
        }
        this.result = this.stack.pop()
        break
      case 'word':
        if (!this.result) {
          throw new Error("They go inside the parentheses")
        }
        this.result.push(tok.value)
        break
      case 'space': break
    }
  }
  flush() {
    if (this.result) {
      throw new Error("Aren't you forgetting something?")
    }
  }
  _transform(chunk, _, cb) {
    try {
      this.process(chunk)
      cb()
    } catch(e) {cb(e)}
  }
  _flush(cb) {
    try {
      this.flush()
      cb()
    } catch(e) {cb(e)}
  }
}
nathan commented 7 years ago

@tjvr Another option for the stream API is to buffer input until we get a regex match that doesn't extend to the end of the buffer—with an optional maximum buffer size like 1MB—and to have a method that signals no more input, or perhaps better a flag to feed() like TextDecoder#decode:

lexer.feed('some inp', {stream: true})
lexer.next() // --> word some
lexer.next() // --> ws
lexer.next() // --> undefined
lexer.feed('ut some more input', {stream: true}
lexer.next() // --> word input
// ...
lexer.next() // --> word more
lexer.next() // --> ws
lexer.next() // --> undefined
lexer.feed(' the last of the input')
// ...
lexer.next() // -> word input

Using streams that look like this:

const rs = new Readable()
rs.push('some inp')
rs.push('ut more input')
rs.end(' the last of the input')

rs.pipe(lexer.clone())
tjvr commented 7 years ago

buffer input until we get a regex match that doesn't extend to the end of the buffer

I thought you said that still isn't correct? https://github.com/tjvr/moo/issues/18#issuecomment-286163965

nathan commented 7 years ago

I thought you said that still isn't correct?

It's not correct, though it would give the correct result in the example I gave. Where it wouldn't give the correct result is if you had regexes like /[a-z]+/ and /\d+(?:[Ee]\d+)?/ and fed the input as 103e followed by 23, because the latter regex would match 103, which stops before the end of the buffer.

The rule here is that for every valid token, all non-empty prefixes of that token must also parse as a single token. As long as that's true, the buffering method works fine. You can always let the number regex match 103e and do the validation at parse time.

tjvr commented 7 years ago

for every valid token, all non-empty prefixes of that token must also parse as a single token

Okay. That seems like a good way to do this. :-)

tjvr commented 7 years ago

The rule here is that for every valid token, all non-empty prefixes of that token must also parse as a single token. As long as that's true, the buffering method works fine.

Actually, this is less reasonable than I first thought :(

Doesn't that mean that given the language foo | foobar, we wouldn't correctly lex foob, ar?

nathan commented 7 years ago

Doesn't that mean that given the language foo | foobar, we wouldn't correctly lex foob, ar?

Correct.