no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
830 stars 66 forks source link

Special-case keywords #5

Closed tjvr closed 7 years ago

tjvr commented 7 years ago

Special-case keywords so that the longest-match prinicple applies, even in edge cases. This helps solve the class vs. className problem. See the updated README for more explanation.

This has a barely-measurable (<5%) impact on the benchmarks. :-)

tjvr commented 7 years ago

Changed my mind: explicit is better than implicit.

tjvr commented 7 years ago

@nathan I was going to change the API to look nicer (and make this stuff implicit), but now you've suggested using objects I'm no longer sure.

What do you think about this?

nathan commented 7 years ago

now you've suggested using objects I'm no longer sure.

I'm not sure what you mean by this, since I don't see how { rule: /re/ } vs [ ['rule', /re/] ] makes a difference here.

If I were implementing this from scratch, I'd store the keywords in a Map from token value => token name, because then you're just doing one map lookup per token (which should be faster than an in test).

~It does seem like a nice feature to have: even though it's fairly trivial for users to switch on the value of an identifier token looking for keywords, it's nice to be able to switch only on name.~

EDIT: you'd probably be switching on the token value anyway, because { keyword: ['if', 'while'] } doesn't turn into two different token types and { if: 'if', while: 'while' } is annoyingly verbose. I'm not sure it's a useful feature.

You could do something slightly weird like accepting an array of tokens which match their names literally, e.g. moo.compile({ ident: /\w+/ }, { keywords: ['if', 'while'] }), which would give you either ident, if, or while tokens.

tjvr commented 7 years ago

@nathan Sorry, I'm being unclear. It seems there are lots of things going on here that should probably be split out...

The motivation for this PR is to lex keywords/identifiers correctly in edge cases (specifically, when some identifier starts with a keyword). It was triggered by #3, which for some reason I failed to reference here.

It works by applying a map of token value => token name after lexing an identifier, so users get KEYWORD instead of IDENTIFIER. It could of course use a Map instead of an object, but that seems like an implementation detail. I didn't think Maps were supported in ES5.


Before #7, I was thinking about changing the "keyword" feature in this PR to be explicit instead of implicit, and changing the API more generally to include many such helpers. Perhaps it might look something like this:

var lexer = moo.compile([
  moo.indentation(null, /^([ \t]+)/, {
    indent: 'INDENT',
    dedent: 'DEDENT',
    tabstop: 8,
  }),
  moo.ignore(/^[ \t]+/),
  moo.identifier('NAME', /([A-Za-z_][A-Za-z0-9_]*)/, {
    keywords: ['if', 'else', 'while'],
  }),
  moo.literal('OP', [
    '(',')', '[', ']', '{', '}',
    ',',':', '.', ';', '@', '->',
    '+=','-=', '*=', '/=', '//=', '%=', '@=',
    '&=','|=', '^=', '>>=', '<<=', '**=',
  ]),
  moo.ignore('COMMENT', /#.*$/),
  moo.newline('NEWLINE', /\r|\r\n|\n/),
  moo('ERRORTOKEN', /[\$?`]/),
  moo('NUMBER', /(?:0|[1-9][0-9]*)/),                // 123
  moo.string('STRING', { start: '"', end: '"', skip: /\\["\\rn]/, }),
]);

Whether your parser should switch on names or their values sounds like a separate issue that's not up to moo at all. Oh, but I suppose you're trying to tell me I shouldn't implement this at all; let the users concern themselves with making sure they avoid keyword/identifier edge cases. I like that idea :)

Sorry for rambling. I blame timezones ;-)

nathan commented 7 years ago

It works by applying a map of token value => token name after lexing an identifier, so users get KEYWORD instead of IDENTIFIER. It could of course use a Map instead of an object, but that seems like an implementation detail. I didn't think Maps were supported in ES5.

I mentioned that implementation detail because you're currently using multiple maps: keywords is a map from original token names to (maps of token values to keyword token names); then you're doing a has check for every token, followed by two gets if the has returned true. I say: just do a get (whether .get(…) or […]) every token, because it's not any slower than a has (and probably faster than in, which checks prototypes). Essentially:

if (group in keywords) {
  group = keywords[group][value] || group
}
// becomes
group = keywords[value] || group

Using objects as string maps is a bit sketchy, but okay if you can trust users of the API not to provide things like __proto__ as keys. Maps are not quite as compatible, but less sketchy and often perform better.

Oh, but I suppose you're trying to tell me I shouldn't implement this at all; let the users concern themselves with making sure they avoid keyword/identifier edge cases. I like that idea :)

That seemed like a good idea to me, but I realize now that many languages wish to forbid the use of keywords as identifiers. This doesn't happen naturally if you just switch on the identifier token value, because you have to explicitly disallow keywords everywhere you want generic identifiers; e.g., your production for var name = … would have to check that name was not var.


In terms of adding extra functionality to specific rules in an object, I'd suggest what I used for the states example; i.e., accept either a RegExp-like or an object with { match: RegExp-like, ignore: true, push: 'otherState', … }

tjvr commented 7 years ago

I say: just do a get (whether .get(…) or […]) every token, because it's not any slower than a has

Right; I tried that initially (see 0d27f8d fb348c9), and it performed terribly (~30% slower).

I hadn't realised Maps can be faster; thanks for that. I'll try again, this time avoiding in. :)