Proposal: inheritance for stateful lexers

no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

BSD 3-Clause "New" or "Revised" License

817 stars 65 forks source link

Proposal: inheritance for stateful lexers #93

Closed nathan closed 5 years ago

nathan commented 5 years ago

PR so you can play around with it. I'll add tests if this seems like the right idea.

This adds a new include rule, which allows you to include all the rules from another state. include behaves as follows:

Each rule is included at most once.
Each rule is included at the earliest possible position in the rule list (i.e., the position where it is most able to match).
Including a state in itself does nothing.
Direct and indirect cyclical inclusions are allowed.

Here's an example of parsing JS-like templates:

const lexer = moo.states({
  $all: { err: moo.error },
  main: {
    include: 'std',
  },
  brace: {
    include: 'std',
    rbrace: { match: '}', pop: 1 },
  },
  template: {
    include: 'std',
    tmid: { match: /}(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2) },
    tend: { match: /}(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1), pop: 1 },
  },
  std: {
    include: ['comment', 'ws'],
    id: /[A-Za-z]\w*/,
    op: /[!=]==|\+[+=]?|-[-=]|<<=?|>>>?=?|&&?|\|\|?|[<>!=/*&|^%]=|[~!,/*^?:%]/,
    tbeg: { match: /`(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2), push: 'template' },
    tsim: { match: /`(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1) },
    str: { match: /'(?:\\[^]|[^\\'])*?'|"(?:\\[^]|[^\\"])*?"/, value: s => s.slice(1, -1) },
    lbrace: { match: '{', push: 'brace'},
  },
  ws: {
    ws: { match: /\s+/, lineBreaks: true },
  },
  comment: {
    lc: /\/\/.+/,
    bc: /\/\*[^]*?\*\//,
  },
})

lexer.reset('`just ` + /* comment */ // line\n`take ${one} and ${a}${two} and a` + {three: `${{four: five}}}`} / "six"')
for (const tok of lexer) {
  if (tok.type !== 'ws') console.log(tok.type, JSON.stringify(tok.value))
}

Here's how it behaves with cycles:

const lex2 = moo.states({
  $all: {
    ws: { match: /\s+/, lineBreaks: true },
  },
  a: {
    a: /a\w/,
    switch: { match: '|', next: 'b' },
    include: 'b',
  },
  b: {
    b: /\wb/,
    switch: { match: '|', next: 'a' },
    include: 'a',
  },
})
lex2.reset('ab ac bb ac cb | ab ac bb ac cb')
for (const tok of lexer) {
  if (tok.type !== 'ws') console.log(tok.type, JSON.stringify(tok.value))
}

tjvr commented 5 years ago

This is cool, thanks for investigating this!

My first thought is: wow, that's a fairly sophisticated example tokenizer you have there (the JS-template-like one). I'm not sure I could have written it myself; I wonder if we've made a tool that only you can use correctly ;-)

(As a minor point of style, I'd have kept the rbrace rule in std, and have my parser generate "unexpected }" errors.)

Each rule is included at the earliest possible position in the rule list (i.e., the position where it is most able to match).

Can you explain what you mean by this? Order is reasonably important in Moo lexers, so it would be good to understand this (and make sure we get the semantics right). Is it that in normal behaviour, the rules are inserted in place of the include; and the "earliest possible position" constraint only applies if a rule would be included more than once?

I suppose using a dictionary for this means we can't allow multiple include directives in different places in the rule list of a state. I'm not sure whether this would be common in practice.

I'm tempted to recommend writing this as a separate transform -- e.g. moo.withStates() -- which takes in a set of states with includes, and emits a set of states with all the includes expanded. But mostly I'd just find this easier to comprehend and test; there's no reason the public API has to be broken up this way.

nathan commented 5 years ago

Is it that in normal behaviour, the rules are inserted in place of the include; and the "earliest possible position" constraint only applies if a rule would be included more than once?

Correct. So if std includes comment, and I include comment before I include std, the rules from comment match at the level of the comment inclusion.

I suppose using a dictionary for this means we can't allow multiple include directives in different places in the rule list of a state. I'm not sure whether this would be common in practice.

That's a good point. I can think of two solutions:

If that case seems uncommon enough, just use the array rule notation we already support:
```
state: [
{include: 'comment'},
{name: 'id', match: /\w+/},
{include: 'std'},
// ...
],
```

Otherwise, switch to or allow a notation like this:

state: {
include_comment: 1,
id: /\w+/,
include_std: 1,
// ...
},

(FWIW, include rules already optionally accept an array of states to include, e.g., {include: ['ws', 'comment']}, so this is only necessary if you want to include in two different places in the rule ordering.)

The include resolver is pretty separable; it's just this chunk of compileStates. Feel free to relegate it to its own function for testing.

tjvr commented 5 years ago

I'd missed your suggestion of $all there.

I think this is cool, and includes are important! However, is there a good reason not to use JavaScript's object spread syntax?

const ws = {
  ws: { match: /\s+/, lineBreaks: true },
}

const comment = {
  lc: /\/\/.+/,
  bc: /\/\*[^]*?\*\//,
}

const std = {
    ...comment,
    ...ws,
    id: /[A-Za-z]\w*/,
    op: /[!=]==|\+[+=]?|-[-=]|<<=?|>>>?=?|&&?|\|\|?|[<>!=/*&|^%]=|[~!,/*^?:%]/,
    tbeg: { match: /`(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2), push: 'template' },
    tsim: { match: /`(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1) },
    str: { match: /'(?:\\[^]|[^\\'])*?'|"(?:\\[^]|[^\\"])*?"/, value: s => s.slice(1, -1) },
    lbrace: { match: '{', push: 'brace'},
}

const main = {
  ...std,
}

const brace = {
  ...std,
  rbrace: { match: '}', pop: 1 },
}

const template = {
  ...std,
  tmid: { match: /}(?:\\[^]|[^\\`])*?\${/, value: s => s.slice(1, -2) },
  tend: { match: /}(?:\\[^]|[^\\`])*?`/, value: s => s.slice(1, -1), pop: 1 },
}

const lexer = moo.states({
  $all: { err: moo.error },
  main, brace, template,
})

nathan commented 5 years ago

is there a good reason not to use JavaScript's object spread syntax?

I'm not sure it's a good reason, but a reason is that this implements adding rather than replacing rules. Compare:

moo.states({
  base: {op: ['*', '/', '+', '-']},
  mod: {include: 'base', op: ['%']},
})

where mod matches *, /, +, -, and % as op, with:

const base = {op: ['*', '/', '+', '-']}
const mod = {...base, op: ['%']}
moo.states({base, mod})

where mod only matches % as op. I was originally doing this in my JS-ish lexer above to distinguish /-op from /-regex, but it became too complicated for an example, so I took it out.

tjvr commented 5 years ago

this implements adding rather than replacing rules

I'm sold.

I'm still not sure if this is the right interface, as discussed previously. But I'd like to merge this and play around with it for a bit.