no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
817 stars 65 forks source link

Proposal: default token types #98

Closed nathan closed 5 years ago

nathan commented 5 years ago

Upon further consideration, #88 seems like a good idea. Markdown(ish) syntaxes are the obvious motivating example: arbitrary text with certain embedded sequences have special meanings, and incomplete special sequences should be passed through verbatim.

const lexer = moo.compile({
  para: {lineBreaks: true, match: /(?:\r?\n|\r){2,}/},
  issu: {match: /#\d+/, value: s => s.slice(1)},
  lstr: /\*\*(?=\S)|__(?=\S)/,
  rstr: /\*\*(?=\s|$)|__(?=\s|$)/,
  escp: {match: /\\./, value: s => s.slice(1)},
  text: moo.default,
})

lexer.reset(`
Upon **further consideration,** #88 seems like a good idea.

Markdown(ish) syntaxes are the obvious motivating example…
`.trim())

console.log([...lexer]) /*
[ { type: 'text', value: 'Upon ' },
  { type: 'lstr', value: '**' },
  { type: 'text', value: 'further consideration,' },
  { type: 'rstr', value: '**' },
  { type: 'text', value: ' ' },
  { type: 'issu', value: '88' },
  { type: 'text', value: ' seems like a good idea.' },
  { type: 'para', value: '\n\n' },
  { type: 'text', value: 'Markdown(ish) syntaxes are the obvious motivating example…' } ]
*/

@tjvr Feel free to bikeshed the name. (fill might be better?)

moranje commented 5 years ago

Sounds great to me, this will make the parsing of my text based language a lot easier. Two things:

  1. The name moo.default might clash with when the node module is being read as an ES6 module e.g. const moo = require('moo').default. I would go with defaultToken but anything is okay really.
  2. Is the order important when specifying a default token?
nathan commented 5 years ago

The name moo.default might clash with when the node module is being read as an ES6 module

AFAIK the de facto rule is to do that conservatively or not at all when the exports don't contain __esModule: true. But again, there's probably a better name than default regardless.

Is the order important when specifying a default token?

No; this matches the behavior of moo.error.

tjvr commented 5 years ago

This is a great idea! I'll need to think about the name. :blush:

tjvr commented 5 years ago

Just to confirm: will /foo|bar/g try and match foo at each index in the buffer, and only once that fails, attempt to match bar? (Which would be bad.)

Sent with GitHawk

nathan commented 5 years ago

@tjvr No. The exec algorithm explicitly works by attempting to match the RegExp at each string index (AdvanceStringIndex is just +1 for non-unicode RegExps), so it will find the earliest match of any complete path through the RegExp, including the current lastIndex if there is a match there.

tjvr commented 5 years ago

I love how simple this is. :heart:

moranje commented 5 years ago

Just tested this on my codebase, it's working as intended.