How to use negative lookahead to give precedence to closing brackets

no-context / moo

Optimised tokenizer/lexer generator! 🐄 Uses /y for performance. Moo.

BSD 3-Clause "New" or "Revised" License

821 stars 65 forks source link

How to use negative lookahead to give precedence to closing brackets #52

Closed techpines closed 7 years ago

techpines commented 7 years ago

Order of rules seems broken. I'm using node@6.10.2 and moo@0.3.2.

Specifically, the example in the documentation fails for me:

moo.compile({
    word:  /[a-z]+/,
    foo:   'foo',
}).reset('foo').next() // -> { type: 'word', value: 'foo' }

moo.compile({
    foo:   'foo',
    word:  /[a-z]+/,
}).reset('foo').next() // -> { type: 'foo', value: 'foo' }

I am getting type: foo for both of these.

deltaidea commented 7 years ago

Docs are wrong, actual result is correct. See section Keywords of the readme. When a regexp covers a keyword, it's treated as a special case where the keyword always wins. See the link for reasoning.

The ordering does work for other combinations though. Try this:

moo.compile({
    identifier:  /[a-z0-9]+/,
    number:  /[0-9]+/,
}).reset('42').next() // -> { type: 'identifier', value: '42' }

moo.compile({
    number:  /[0-9]+/,
    identifier:  /[a-z0-9]+/,
}).reset('42').next() // -> { type: 'number', value: '42' }

tjvr commented 7 years ago

@deltaidea is right (as usual!). The readme needs updating; it was true when written :-)

techpines commented 7 years ago

Is there a way to give a literal precedence?

moo.compile({
    fortytwo:  /42/,
    identifier:  /[a-z0-9]+/,
}).reset('42abc42')
// -> would like to get ['42', 'abc', '42'] 
// -> instead of ['42', 'abc42']

Or some other technique to solve problems like this, thanks.

nathan commented 7 years ago

Is there a way to give a literal precedence?

There's always /(?:(?!42)[a-z0-9])+/, but that's probably not the best way. It sort of depends on your use case.

techpines commented 7 years ago

I am using nearley to parse a markup language, and I am trying to use moo to tokenize for both speed and also to help avoid potential grammar ambiguities. The main feature in the markup I'm trying to parse are these nested commands with delimiters {{ and }}. So based on your example, I'm doing this and it works!

/(?:(?!\{\{|\}\})[^])+/

But is there a reason why negative lookahead regex like this would be bad?

deltaidea commented 7 years ago

Another solution is to lex single curly brackets as separate tokens and then deal with combinations in the grammar:

command -> %lcurly %lcurly (expression | %rcurly expression):* %rcurly %rcurly

This rule allows {{ ... }} to contain } if it's followed by something else (not a second }).

Check out this example in the playground:

command -> "{" "{" (expression | "}" expression):* "}" "}" {% ([, , expressions]) => [].concat(...expressions) %}
expression -> "a" {% id %}

You can also do the same with RegExp if you'd like to keep the whole thing a single token:

{{          // open
(           // as many as you wish of...
    [^}]    // either not "{"
    |       // or
    } [^}]  // "{" followed by not "{"
)*
}}          // close

Both of these solutions solve the ambiguity of {{}}}}}} by saying that the first }} closes the command.

tjvr commented 7 years ago

What is your use case? :-)

Lexers usually try to adhere to the principle of longest match (return the longest token possible), since that's usually what you want.

techpines commented 7 years ago

I'm trying to parse a couple million snippets of a markup language that have nested commands that are marked by {{ and }}. And honestly using negative lookahead regex statements seem to be working really well for my use case and the speed seems good. So I really appreciate everyone's help!

@tjvr I would say, that you've built a really good lexer for Node, and it's unfortunate that users might be turned off because some doc examples are inaccurate. I would be more than happy to submit a PR for some of the doc examples if you need ;)

The other main doc thing that was bugging me was this in the keywords section, you seem to have some syntax errors (unless there are new ES6/7 features I don't know about):

 moo.compile({
      ['lparen',  '('],
      ['rparen',  ')'],
      ['keyword', ['while', 'if', 'else', 'moo', 'cows']],
    })

tjvr commented 7 years ago

I would be more than happy to submit a PR for some of the doc examples if you need ;)

That would be lovely, yes please!

you seem to have some syntax errors

I'm not sure I see what the problem is? :-)

nathan commented 7 years ago

I'm not sure I see what the problem is? :-)

moo.compile({
            ^
  ['lparen',  '('],
  ['rparen',  ')'],
  ['keyword', ['while', 'if', 'else', 'moo', 'cows']],
})
^

tjvr commented 7 years ago

...that took me a while. Thanks, @nathan :-)

tjvr commented 7 years ago

Readme updated. Thanks everyone :-)