no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
821 stars 65 forks source link

Feature request: Emitting EOF #64

Closed kestereverts closed 7 years ago

kestereverts commented 7 years ago

I'd like to propose emitting an EOF token when the end of the input is reached. This is useful when using moo with nearley (or any other parser generator that does not natively support EOF tokens). This way, the parser can generate an error when the EOF token is unexpected. This will also add information about where the EOF happened, providing a more useful error message.

It could be implemented with the following API:

moo.compile({
  number:  /(0|[1-9][0-9]*)/,
  end_of_input: moo.eof
});

// when using states, each state needs a separate eof
moo.state({
  main: {
    string:  /"((?:\\["\\]|[^\n"\\])*)"/,
    lparen:  {match: '(', push: 'parens'},
    eof: moo.eof,
  },
  parens: {
    number:  /(0|[1-9][0-9]*)/,
    rparen: {match: ')', pop: 1},
    eof_in_parens: {match: moo.eof, value: "a string"} // default value is "<eof>" but it can be changed
  }
});

EOF would be emitted once and only once when the end of the input is reached. Thereafter, calling next will return undefined as before. EOF will not be emitted when there is no moo.eof present. It is invalid to have multiple moo.eof's per group.

nathan commented 7 years ago
const itt = require('itt')
const moo = require('moo')

const lexer = moo.compile({
  id: /\w+/,
  sp: {match: /\s+/, lineBreaks: true},
})

lexer.reset('foo bar')
const tokens = itt.push({ type: 'eof', value: '<eof>' }, lexer)
for (const tok of tokens) {
  console.log(tok)
}

// { type: 'id', … }
// { type: 'sp', … }
// { type: 'id', … }
// { type: 'eof', … }

If you need line/col information it's still pretty easy to roll your own:

lexer.reset('foo bar')
const tokens = withEof(lexer, { type: 'eof', value: '<eof>' })
for (const tok of tokens) {
  console.log(tok)
}

function* withEof(lexer, eof) {
  yield* lexer
  yield Object.assign(eof, {
    toString() { return this.value },
    offset: lexer.index,
    size: 0,
    lineBreaks: 0,
    line: lexer.line,
    col: lexer.col,
  })
}
tjvr commented 7 years ago

Thanks for the suggestion! I think as @nathan points out this is pretty easy to add yourself in a stage on top of Moo, so we don't want to include it in Moo core. Sorry! :-)

kestereverts commented 7 years ago

Thanks, @tjvr and @nathan. Appending EOF to the token stream is possible solution, but you would no longer have access to moo's API, which is an important aspect of this proposal.

nearley can use moo as lexer with just one statement in its grammar:

@lexer your_moo_instance

This instance has to comply with nearley's Custom lexers interface, which moo does. You would have to replicate moo's API with the suggested solution above. This is why I thought emitting EOFs is an elegant solution. Thank you for considering, though!

nathan commented 7 years ago

@tjvr this one might be useful/trivial enough to include for nearley, especially since there's no runtime cost when tokenizing (just a check inside the EOF if).

tjvr commented 7 years ago

just a check inside the EOF if

You're right, this would be cheap (although you'd also need to keep track of whether you'd already emitted the EOF token).

But I don't see how this benefits Nearley specifically. Using EOF tokens inside a CFG is fairly unusual IMHO; it's usually not what you want (unlike in a PEG where it might make more sense).

nathan commented 7 years ago

Using EOF tokens inside a CFG is fairly unusual IMHO

I've never used nearley and assumed it was reasonable, but it might not be. What does nearley do with next returning undefined / a parse that doesn't consume all of the input?

tjvr commented 7 years ago

In nearly, next() returning undefined indicates EOF, which is really just the end of the chunk passed to feed(). When you then call finish(), you'll get zero results.

nathan commented 7 years ago

Ah, then this would probably be superfluous.