no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
824 stars 66 forks source link

Unicode support for keywords #133

Open stuchl4n3k opened 5 years ago

stuchl4n3k commented 5 years ago

Since /u is supported now, is there some convenient way to define a rule using an array of keywords with unicode enabled? Sth. like:

const keywords = ['foo', 'bar'];
moo.compile({
   KEY: {
      match: keywords, 
      type: moo.keywords({KEY: keywords}), 
      unicode: true,
   },
});

In my understanding moo.keywords in the unicode scenario only work if the "match" is a pattetrn with an /u flag.

nathan commented 5 years ago

moo.keywords only works properly when you use it on a matcher that matches anything that could be a word—not just keywords. For example, this lexer doesn't work the way you seem to expect it to:

const moo = require('moo')

const KW = ['ban', 'this']
const lexer = moo.compile({
  kw: {match: KW, type: moo.keywords({kw: KW})},
  w: /[A-Za-z_][\w]*/,
  ws: / +/,
})
lexer.reset('banana ban')
lexer.next() // {type: 'kw', value: 'ban'}
lexer.next() // {type: 'w', value: 'ana'}

The normal use case for moo.keywords looks like this:

const moo = require('moo')

const KW = ['ban', 'this']
const lexer = moo.compile({
  w: {match: /[A-Za-z_][\w]*/, type: moo.keywords({kw: KW})},
  ws: / +/,
})
lexer.reset('banana ban')
lexer.next() // {type: 'w', value: 'banana'}
lexer.next() // {type: 'ws', value: ' '}
lexer.next() // {type: 'kw', value: 'ban'}

It actually works fine with Unicode as-is:

const moo = require('moo')

const KW = ['η', 'ο', 'το', 'οι', 'τα']
const lexer = moo.compile({
  w: {match: /\p{XIDS}\p{XIDC}*/u, type: moo.keywords({kw: KW})},
  ws: {match: /\p{WSpace}+/u, lineBreaks: true},
})
lexer.reset('η ηθική')
lexer.next() // {type: 'kw', value: 'η'}
lexer.next() // {type: 'ws', value: ' '}
lexer.next() // {type: 'w', value: 'ηθική'}

We also already allow string literal and array matches to be combined with /u regular expressions, so I'm not sure what you're asking for here.

(Some of these changes haven't been published to npm yet [@tjvr]; maybe that's where the confusion is coming from?)

stuchl4n3k commented 5 years ago

Thank nathan, after seeing the first two examples it became much clearer.

Regarding the array match combined with /u - I haven't found that in the doc nor in the tests.

nathan commented 5 years ago

I haven't found that in the doc nor in the tests.

We should probably have a test for that. The /u tests are a bit sparse at the moment.

agorischek commented 5 years ago

When’s the next npm publish planned?

tjvr commented 5 years ago

I've published 0.5.1. :+1: