Closed gerases closed 6 years ago
That regex doesn't seem to work with LALR or with Earley?
Regarding LALR: I can see you have different terminals matching the same string. LALR relies on a lexer, which creates tokens before they reach the parser, and so every terminal has to be chosen without awareness to the structure of the grammar.
I did write a slight improvement which makes the lexer a bit more contextually aware. You can try it with parser="lalr", lexer="contextual"
, but I doubt it will work for your case.
The error you got happens because the token COMMAND matches everything, and it happens to take priority over some of the other terminals.
If the contextual lexer doesn't solve it, you can solve this specific error by giving it priority zero, which will make it match last:
COMMAND.0: "whatever"
However, this issue repeats with a few other terminals too. You will probably have to disentangle them.
Thank you for the quick response, Erez! I do appreciate that.
That particular error happens only when using LALR. I will try to implement your suggestions in short order. The though of LALR relying on token uniqueness for context did cross my mind, but I quickly dismissed it along with other 10 theories in my head at that moment :) I don't have much experience with parsing, but it's super interesting.
Yes, it's super interesting, and also super confusing :)
Let me know how it works out.
Will do. Quick question, if I'm not using LALR and given my grammar above, do you know why the parser is having trouble with:
DBA ALL = (oracle) ALL, !SU : ALL = (postgres) ALL, !SU
It looks like /[^,:\n]+/
for the COMMAND token consumes the space before the ':' and that causes an issue. Modifying the rule to/[^,:\n]+(?!:)/
(negative lookahead preventing the consumption of the space) does help, but grinds the parser to a halt on even a few rules.
Well, it sounds like a bug. But why would consuming the space be an issue?
It's probably not the regexp that grinds it to a halt, but rather the parser. I have recently found a performance issue in the Earley parser when dealing with ambiguous grammars. I have written a fix for it, but I haven't merged it to master yet, because I want to run a few more tests on it.
In the meanwhile, check-out the branch earley_fix2
(https://github.com/erezsh/lark/tree/earley_fix2) , and see if it fixes your issue.
In the meanwhile, check-out the branch earley_fix2
Awesome (🥇 ) Will try by tonight. Tomorrow morning (American morning :))
So you think %ignore /[\\\\]/
is the correct syntax to instruct the parser to ignore a backslash at the end of the line?
Also, if I say /^/
and /$
in a regex, does it mean beginning and end of the whole input string or the line being parsed?
Whereas it took the current master version almost 2 minutes to parse 10 lines, your dev branch got a 300 line file parsed in less than 3 secs. I'm hoping not much else changed. I had to modify some code in my transformer, but it seems fine otherwise. Woot woot! Thanks again.
Glad to hear it!
%ignore /[\\\\]/
can ignore any backslash, not just at the end of the line. By default, ^ and $ refer to the whole file. But if you use latest master, you can try:
%ignore /\\\\$/m
The 'm' flag tells the regexp to match $ to newlines. I also merged the Earley optimization to master, so you shouldn't have any problem with that.
But, I must say, it's a pretty strange language, if backslashes do nothing in it. Usually it has some structural role, such as continuing the line. In which case you will have to ignore the newline as well. Perhaps like this?
%ignore /\\\\$\n/m
I must say I never tried something like this, so let me know if it doesn't work, and I'll try to figure out what to do next.
%ignore /[\\]/ can ignore any backslash, not just at the end of the line. By default, ^ and $ refer to the whole file. But if you use latest master, you can try:
%ignore /\\$/m
Excellent
The 'm' flag tells the regexp to match $ to newlines. I also merged the Earley optimization to master, so you shouldn't have any problem with that.
Cool
But, I must say, it's a pretty strange language, if backslashes do nothing in it. Usually it has some structural role, such as continuing the line. In which case you will have to ignore the newline as well. Perhaps like this?
%ignore /\\$\n/m
Indeed the backslash is a line continuation. I will try to modify and watch out for any problems.
I must say I never tried something like this, so let me know if it doesn't work, and I'll try to figure out what to do next.
Will do.
There's one small thing I discovered yesterday that I wanted to ask you about. Given this input:
DBA ALL = (oracle) ALL
the tree looks like this:
user_list
user DBA
host_list
host ALL
cmnd_spec
runas_list
user oracle
cmnd_alias_name ALL
After I apply my transformations, which are very simple, I get back a tuple that looks like this:
('user_spec', [['DBA'], ['ALL'], [['oracle'], 'ALL']])
So far so good.. If the source is two lines like so:
DBA ALL = (oracle) ALL
DBA ALL2 = (oracle2) ALL
three looks like this:
user_spec
user_list
user DBA
host_list
host ALL
cmnd_spec
runas_list
user oracle
cmnd_alias_name ALL
user_spec
user_list
user DBA
host_list
host ALL2
cmnd_spec
runas_list
user oracle2
cmnd_alias_name ALL
or any single rule, the output comes out as an array of two tuples:
[
('user_spec', [['DBA'], ['ALL'], [['oracle'], 'ALL']]),
('user_spec', [['DBA'], ['ALL2'], [['oracle2'], 'ALL']])
]
As a result, I have to first check if the result is a tuple or an array. If it's a tuple, I create a 1-element array from the tuple.
Is that expected that a single rule might return a tuple while two will return an array? Is it something about my grammar? This can be a tough question and since it's not critical, you can ignore it :)
Well, I assume user_spec
always returns a tuple as it should. But you match it in ?sudo_item
, and the ?
sign literally means: If there is only one item, return it instead of appending it as a branch.
Does that answer your question?
yep, it certainly does answer my question.
thank you!
On Nov 16, 2017, at 1:03 PM, Erez Shinan notifications@github.com wrote:
Well, I assume user_spec always returns a tuple as it should. But you match it in ?sudo_item, and the ? sign literally means: If there is only one item, return it instead of appending it as a branch.
Does that answer your question?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.
With the default parser, is there a way to see which line of input caused the problem?
I get the error below when trying to ignore the backslash with %ignore /[\\\\]$/m
lark.common.GrammarError: Rules aren't allowed inside tokens (m in __IGNORE_0)
You should use the latest master branch for now, instead of the latest pypi release. I plan to make a release soon, once I see all the recent breaking changes are stable.
Regarding line and column for the default parser, I haven't implemented it yet, but I know it's a crucial feature. Feel free to open a new issue for it if you like, and I'll try to get to it soon.
Last question: is there a built in function to read the grammar from a file?
Any reason why the default open(filename).read() isn't good enough?
No reason really. Just thought there might be a convenience function. All good. Thanks ago. I think we can close this ticket. Thanks again!
Well, not sure if it makes any difference to convenience, but you can pass file objects to Lark, so something like Lark(open(filename))
will work just as well.
Thanks for the bug reports and suggestions!
Yep, that's quite good enough.
Hello Erez,
I'm having trouble coming up with the correct rule when parsing the sudo grammar. Here's my grammar:
A sudo rule can look like this:
DBA ALL = (oracle) ALL, !SU
or like this:
DBA ALL = (oracle) ALL, !SU : ALL = (postgres) ALL, !SU
The grammar handles the first case fine. It has trouble parsing the second variant. It boils down to the COMMAND token, which is anything following the
runas
bit ((postgres)
,(oracle)
, etc) and should include anything but[:,\n]
. That regex doesn't seem to work.Also, sudo lines may have the
\
line continuation at the end. Is%ignore /[\\\\]/
the correct way to handle that situation? It does seem to operate correctly.Last question, when i specify
parser='lalr'
, I get this:MY INPUT:
DBA ALL = (oracle) ALL
ERROR:Does that mean my grammar is not LALR compatible?