lark does not tokenize properly

lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

MIT License

4.8k stars 409 forks source link

lark does not tokenize properly #171

Closed supriyo-biswas closed 6 years ago

supriyo-biswas commented 6 years ago

I have the following grammar:

parser = Lark(r'''
input                : word+
word                 : unquoted | lquoted | squoted | dquoted

lquoted               : /\$'[^']*'/
squoted               : /'[^']*'/
dquoted               : /"(?:[^"]+|\\.)*"/
unquoted              : /(?:[^\s()&;|<>]+|\\.)+/

%import common.WS
%ignore WS
''', start='input', parser='lalr')

When I try to parse a string like $'abcd', this is what I get:

> $'abcd'
input
  word
    unquoted    $'abcd'

However, on the basis of the rules, shouldn't the result yield a lquoted instead of a unquoted?

erezsh commented 6 years ago

What makes you think this is a bug?

>>> import re
>>> re.match('(?:[^\s()&;|<>]+|\\.)+', 'abcd')
<_sre.SRE_Match object at 0x7f52fb548850>

Both can match abcd, so Lark chooses one by length. If you prefer one over the other, use prioritized terminals.

supriyo-biswas commented 6 years ago

Coming from tools like lex/yacc, I wasn't expecting that behaviour.

Anyway, I tried using the following prioritized terminals:

lquoted.4             : /\$'[^']*'/
squoted.3             : /'[^']*'/
dquoted.2             : /"(?:[^"]+|\\.)*"/
unquoted.1            : /(?:[^\s()&;|<>]+|\\.)+/

Now, $'abcd' is recognized correctly, but unquoted tokens are not:

> $'abcd'
input
  word
    lquoted $'abcd'
> abcd
input
  word
    unquoted    abc
  word
    unquoted    d

erezsh commented 6 years ago

How would lex/yacc behave in this situation?

Yes, this is a bug with the Earley parser that we're currently trying to fix. I suggest using Earley with lexer=standard, or trying the LALR(1) parser (with contextual lexer), if that's possible.

If you need full Earley support, for example for an ambiguous grammar, let me know and I'll suggest other solutions.

supriyo-biswas commented 6 years ago

How would lex/yacc behave in this situation?

It tries to match first rules first, which means $'abcd' would be correctly detected as lquoted and not unquoted.

I suggest using Earley with lexer=standard, or trying the LALR(1) parser (with contextual lexer)

Neither parser=earley,lexer=standard or parser=lalr,lexer=contextual works for my case, and detects lquoted tokens as unquoted. parser=earley,lexer=standard suffers from the bug you're trying to fix.

erezsh commented 6 years ago

I tried adapting your grammar to using prioritized terminals (which are different from prioritized rules). It seems to work, so let me know if I misunderstood you.

>>> parser = Lark(r'''
... input                : word+
... word                 : unquoted | lquoted | squoted | dquoted
...
... lquoted               : LQUOTED
... squoted               : /'[^']*'/
... dquoted               : /"(?:[^"]+|\\.)*"/
... unquoted              : UNQUOTED
...
... LQUOTED.2: /\$'[^']*'/
... UNQUOTED: /(?:[^\s()&;|<>]+|\\.)+/
...
... %import common.WS
... %ignore WS
... ''', start='input', parser='lalr')

>>> print(parser.parse("$'abcd'").pretty())
input
  word
    lquoted $'abcd'

>>> print(parser.parse("abcd").pretty())
input
  word
    unquoted    abcd

erezsh commented 6 years ago

Btw, it slipped my mind to suggest it, but you can use \b to force word boundaries on Earley, to prevent words cutting out in the middle.

EivindEklundGoogle commented 6 years ago

For context: I tried using $ on // comments (pattern /\/\/.*$/) with Early and that did not work (the last word was still grabbed and considered a token by itself). I'd expect the same problem with \b.

I had to use /\/\/.*\n/ to actually make sure that the comments grabbed the rest of the line.

erezsh commented 6 years ago

Well, that's not surprising. $ only matches the end of file, unless you use the multiline flag.

supriyo-biswas commented 6 years ago

Prioritized terminals do work in my case, but having to seperate terminals in this way is a little annoying :(

Is there a way to use Lark's parsing component only, and handle the tokenization using some custom code?

erezsh commented 6 years ago

@supriyo-biswas

I just pushed a new commit to master, that allows to use a custom lexer.

See this example: https://github.com/lark-parser/lark/blob/master/examples/custom_lexer.py

supriyo-biswas commented 6 years ago

Thanks!