Closed supriyo-biswas closed 6 years ago
What makes you think this is a bug?
>>> import re
>>> re.match('(?:[^\s()&;|<>]+|\\.)+', 'abcd')
<_sre.SRE_Match object at 0x7f52fb548850>
Both can match abcd
, so Lark chooses one by length. If you prefer one over the other, use prioritized terminals.
Coming from tools like lex/yacc, I wasn't expecting that behaviour.
Anyway, I tried using the following prioritized terminals:
lquoted.4 : /\$'[^']*'/
squoted.3 : /'[^']*'/
dquoted.2 : /"(?:[^"]+|\\.)*"/
unquoted.1 : /(?:[^\s()&;|<>]+|\\.)+/
Now, $'abcd'
is recognized correctly, but unquoted tokens are not:
> $'abcd'
input
word
lquoted $'abcd'
> abcd
input
word
unquoted abc
word
unquoted d
How would lex/yacc behave in this situation?
Yes, this is a bug with the Earley parser that we're currently trying to fix. I suggest using Earley with lexer=standard
, or trying the LALR(1) parser (with contextual lexer), if that's possible.
If you need full Earley support, for example for an ambiguous grammar, let me know and I'll suggest other solutions.
How would lex/yacc behave in this situation?
It tries to match first rules first, which means $'abcd'
would be correctly detected as lquoted
and not unquoted
.
I suggest using Earley with lexer=standard, or trying the LALR(1) parser (with contextual lexer)
Neither parser=earley,lexer=standard or parser=lalr,lexer=contextual works for my case, and detects lquoted
tokens as unquoted
. parser=earley,lexer=standard suffers from the bug you're trying to fix.
I tried adapting your grammar to using prioritized terminals (which are different from prioritized rules). It seems to work, so let me know if I misunderstood you.
>>> parser = Lark(r'''
... input : word+
... word : unquoted | lquoted | squoted | dquoted
...
... lquoted : LQUOTED
... squoted : /'[^']*'/
... dquoted : /"(?:[^"]+|\\.)*"/
... unquoted : UNQUOTED
...
... LQUOTED.2: /\$'[^']*'/
... UNQUOTED: /(?:[^\s()&;|<>]+|\\.)+/
...
... %import common.WS
... %ignore WS
... ''', start='input', parser='lalr')
>>> print(parser.parse("$'abcd'").pretty())
input
word
lquoted $'abcd'
>>> print(parser.parse("abcd").pretty())
input
word
unquoted abcd
Btw, it slipped my mind to suggest it, but you can use \b
to force word boundaries on Earley, to prevent words cutting out in the middle.
For context: I tried using $ on // comments (pattern /\/\/.*$/) with Early and that did not work (the last word was still grabbed and considered a token by itself). I'd expect the same problem with \b.
I had to use /\/\/.*\n/ to actually make sure that the comments grabbed the rest of the line.
Well, that's not surprising. $
only matches the end of file, unless you use the multiline flag.
Prioritized terminals do work in my case, but having to seperate terminals in this way is a little annoying :(
Is there a way to use Lark's parsing component only, and handle the tokenization using some custom code?
@supriyo-biswas
I just pushed a new commit to master, that allows to use a custom lexer.
See this example: https://github.com/lark-parser/lark/blob/master/examples/custom_lexer.py
Thanks!
I have the following grammar:
When I try to parse a string like
$'abcd'
, this is what I get:However, on the basis of the rules, shouldn't the result yield a
lquoted
instead of aunquoted
?