no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
821 stars 65 forks source link

Add type transform #85

Closed tjvr closed 5 years ago

tjvr commented 6 years ago

We recently added a value transform.

This PR:

The existing value transform takes the text and returns the value. By default, the text is used unchanged.

The new type transform takes the text and returns the type. By default, the type of the rule is used (e.g. identifier).

Example: case-insensitive keywords

This is my preferred solution for #67 / #78.

For example, you can create a customised version of moo.keywords which matches case-insensitively:

const caseInsensitiveKeywords = map => {
  const transform = moo.keywords(map)
  return text => transform(text.toLowerCase())
}

let lexer = compile({
  identifier: {
    match: /[a-zA-Z]+/,
    type: caseInsensitiveKeywords({
      keyword: ['class', 'def'],
    }),
  },
})

Lexer#has()

This unfortunately makes it impossible to write a Lexer#has function, since we can't infer what token names might be returned by this custom function.

This will make Moo incompatible with the current version of Nearley: we introduced has() so that we could tell whether %foo refers to a custom token matcher such as foo = { test: x => Number.isInteger(x) }, or a lexer token. But custom token matchers will likely be removed [from Nearley] going forward, so has() will have no use.

tjvr commented 5 years ago

I've rewritten this on top of the latest master.

Lexer#has() will now always return true, so most Nearley grammars should continue to work.

houghtonap commented 3 years ago

A few notes about the example and case insensitivity.

  1. I believe, in the example the code keyword: ['class', 'def'] should be keyword: ['CLASS', 'DEF] since the purpose of the const caseInsensitiveKeywords is to lower case the keywords given.
  2. The example does not demonstrate case insensitivity. As far as I can determine, the example demonstrates that the keyword could be either Upper case or Lower case which is a subset of case insensitivity. For example, a case insensitive match would match CLASS, class, ClAsS, cLaSs, etc.
  3. The above example and moo.keywords seems to be a round about way of achieving case insensitivity or other possibilities. Perusing moo.js, it seems that a simpler solution would be to allow an Array that is a mixture of strings or regular expressions in keywordTransform. Currently, only strings are allowed for the keyword array otherwise an error is thrown indicating such. However, if regular expressions were allowed in addition to strings you could do:
    let lexer = compile({
      identifier: {
        match: [ /[Cc][Ll][As][Ss][Ss]/, /[Dd][Ee][Ff]/, 'lambda' /* I really only want this one as lower case */ ],
        type: v => v.toLocaleUpperCase( ),
      },
    })

    When the Array contains only strings, proceed with the existing transform code that builds a switch statement, otherwise convert the strings in the array to regular expressions (quoting meta characters), then create a matchable regular expression in place of the switch statement being built. The returned function from keywordTransform would just match the token found against the built regular expression, e.g., token.match( rePossibilities ). I suspect that there will be a threshold between executing the switch statement vs. executing the regular expression match, which may be something else to consider in keywordTransform.

For (1,2) above, perhaps I misunderstood the example in this issue, feel free to enlighten me.

tjvr commented 3 years ago

I think you misunderstood the example; (1) and (2) don't sound right to me.

caseInsensitiveKeywords uses the keywords ['class', 'def'] passed in to build a regular non-case-sensitive map using the built-in moo.keywords() function.

It then returns a closure which calls toLowerCase() on the value -- the token that was lexed -- before passing it to moo.keywords().