no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.
BSD 3-Clause "New" or "Revised" License
821 stars 65 forks source link

Multiword keywords do not work #80

Closed dselman closed 6 years ago

dselman commented 6 years ago

From some preliminary testing, it looks like keywords must be simple string literals. Is this by design? Is there another (better) way to achieve this?

Sample Nearley grammar using a Moo lexer:

@{%
const moo = require("moo");

const lexer = moo.compile({
  ws:     /[ \t]+/,
  number: /[0-9]+/,
  word: /[a-z]+/,
  times:  /\*|x/,
  SPACE: {match: /\s+/, lineBreaks: true},
  IDEN: {match: /[a-zA-Z]+/, keywords: {
        notice: ['NOTICE TO']
      }},
});
%}

# Pass your lexer object using the @lexer option:
@lexer lexer

# Use %token to match any token of that type instead of "token":
root -> %notice %ws %IDEN
{% (data) => {console.log(data);return data;} %}

Sample input:

NOTICE TO Foo

Output:

invalid syntax at line 1 col 1:

  NOTICE
  ^
Unexpected IDEN token: "NOTICE"
 {"offset":0,"token":{"type":"IDEN","value":"NOTICE","text":"NOTICE","offset":0,"lineBreaks":0,"line":1,"col":1}}
nathan commented 6 years ago

The the token regular expression has to actually match its keywords. /[a-zA-Z]+/ doesn't match NOTICE TO (since it doesn't allow whitespace).

Why not simply treat all words as IDEN and recognize the sequence NOTICE TO of two IDENs in your parser?

If you really want to do it in your lexer, you can add notice as a separate token type with {match: /NOTICE\s+TO(?![a-zA-Z])/, lineBreaks: true}

tjvr commented 6 years ago

The the token regular expression has to actually match its keywords

Ooh, we should add a warning for this :-)