lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.8k stars 409 forks source link

Question (possibly bug?) #41

Closed gerases closed 6 years ago

gerases commented 6 years ago

Hello Erez,

I'm having trouble coming up with the correct rule when parsing the sudo grammar. Here's my grammar:

      ?sudo_item : (alias | user_spec)*

      alias : "User_Alias"    user_alias  (":" user_alias)*
              | "Runas_Alias" runas_alias (":" runas_alias)*
              | "Host_Alias"  host_alias  (":" host_alias)*
              | "Cmnd_Alias"  cmnd_alias  (":" cmnd_alias)*

      user_alias  : ALIAS_NAME "=" user_list
      host_alias  : ALIAS_NAME "=" host_list
      runas_alias : ALIAS_NAME "=" runas_list
      cmnd_alias  : ALIAS_NAME "=" cmnd_list

      user_spec      : user_list host_list "=" cmnd_spec_list (":" host_list "=" cmnd_spec_list)*
      cmnd_spec_list : cmnd_spec ("," cmnd_spec)*
      cmnd_spec      : runas_spec? tag_spec* command
      runas_spec     : "(" runas_list? (":" runas_list)? ")"
      command        : "!"* command_name | "!"* cmnd_alias
      command_name   : COMMAND
      file_name      : /[-_.a-z0-9A-Z\/]\/]+/

      host_list  : host      ("," host)*
      user_list  : user      ("," user)*
      runas_list : user      ("," user)*
      cmnd_list  : command   ("," command)*

      host : HOST_NAME

      user :    "!"*       USER_NAME
              | "!"* "%"   GROUP_NAME
              | "!"* "#"   UID
              | "!"* "%#"  GID
              | "!"* "+"   NETGROUP_NAME
              | "!"* "%:"  NONUNIX_GROUP_NAME
              | "!"* "%:#" NONUNIX_GID
              | "!"* user_alias

      tag_spec : (tag_nopwd
                  | tag_pwd
                  | tag_noexec
                  | tag_exec
                  | tag_setenv
                  | tag_nosetenv
                  | tag_log_output
                  | tag_nolog_output) ":"

      tag_pwd          : "PASSWD"
      tag_nopwd        : "NOPASSWD"

      tag_exec         : "EXEC"
      tag_noexec       : "NOEXEC"
      tag_setenv       : "SETENV"
      tag_nosetenv     : "NOSETENV"
      tag_log_output   : "LOG_OUTPUT"
      tag_nolog_output : "NOLOG_OUTPUT"

      UID                : /[0-9]+/
      GID                : /[0-9]+/
      NONUNIX_GID        : /[0-9]+/
      USER_NAME          : /[-_.a-z0-9A-Z]+/
      GROUP_NAME         : CNAME
      NETGROUP_NAME      : CNAME
      NONUNIX_GROUP_NAME : CNAME
      ALIAS_NAME         : CNAME
      HOST_NAME          : /[-_.a-z0-9A-Z\[\]*]+/
      COMMAND            : /[^,:\n]+/

      %import common.CNAME
      %import common.WS
      %ignore /[\\\\]/
      %ignore WS

A sudo rule can look like this:

DBA ALL = (oracle) ALL, !SU

or like this:

DBA ALL = (oracle) ALL, !SU : ALL = (postgres) ALL, !SU

The grammar handles the first case fine. It has trouble parsing the second variant. It boils down to the COMMAND token, which is anything following the runas bit ((postgres), (oracle), etc) and should include anything but [:,\n]. That regex doesn't seem to work.

Also, sudo lines may have the \ line continuation at the end. Is %ignore /[\\\\]/ the correct way to handle that situation? It does seem to operate correctly.

Last question, when i specify parser='lalr', I get this:

MY INPUT: DBA ALL = (oracle) ALL ERROR:

lark.common.UnexpectedToken: Unexpected token Token(COMMAND, ' ALL = (oracle) ALL') at line 1, column 3.

Does that mean my grammar is not LALR compatible?

erezsh commented 6 years ago

That regex doesn't seem to work with LALR or with Earley?

Regarding LALR: I can see you have different terminals matching the same string. LALR relies on a lexer, which creates tokens before they reach the parser, and so every terminal has to be chosen without awareness to the structure of the grammar.

I did write a slight improvement which makes the lexer a bit more contextually aware. You can try it with parser="lalr", lexer="contextual", but I doubt it will work for your case.

The error you got happens because the token COMMAND matches everything, and it happens to take priority over some of the other terminals.

If the contextual lexer doesn't solve it, you can solve this specific error by giving it priority zero, which will make it match last:

COMMAND.0: "whatever"

However, this issue repeats with a few other terminals too. You will probably have to disentangle them.

gerases commented 6 years ago

Thank you for the quick response, Erez! I do appreciate that.

That particular error happens only when using LALR. I will try to implement your suggestions in short order. The though of LALR relying on token uniqueness for context did cross my mind, but I quickly dismissed it along with other 10 theories in my head at that moment :) I don't have much experience with parsing, but it's super interesting.

erezsh commented 6 years ago

Yes, it's super interesting, and also super confusing :)

Let me know how it works out.

gerases commented 6 years ago

Will do. Quick question, if I'm not using LALR and given my grammar above, do you know why the parser is having trouble with:

DBA ALL = (oracle) ALL, !SU : ALL = (postgres) ALL, !SU

It looks like /[^,:\n]+/ for the COMMAND token consumes the space before the ':' and that causes an issue. Modifying the rule to/[^,:\n]+(?!:)/ (negative lookahead preventing the consumption of the space) does help, but grinds the parser to a halt on even a few rules.

erezsh commented 6 years ago

Well, it sounds like a bug. But why would consuming the space be an issue?

It's probably not the regexp that grinds it to a halt, but rather the parser. I have recently found a performance issue in the Earley parser when dealing with ambiguous grammars. I have written a fix for it, but I haven't merged it to master yet, because I want to run a few more tests on it.

In the meanwhile, check-out the branch earley_fix2 (https://github.com/erezsh/lark/tree/earley_fix2) , and see if it fixes your issue.

gerases commented 6 years ago

In the meanwhile, check-out the branch earley_fix2

Awesome (🥇 ) Will try by tonight. Tomorrow morning (American morning :))

So you think %ignore /[\\\\]/ is the correct syntax to instruct the parser to ignore a backslash at the end of the line?

Also, if I say /^/ and /$ in a regex, does it mean beginning and end of the whole input string or the line being parsed?

gerases commented 6 years ago

Whereas it took the current master version almost 2 minutes to parse 10 lines, your dev branch got a 300 line file parsed in less than 3 secs. I'm hoping not much else changed. I had to modify some code in my transformer, but it seems fine otherwise. Woot woot! Thanks again.

erezsh commented 6 years ago

Glad to hear it!

%ignore /[\\\\]/ can ignore any backslash, not just at the end of the line. By default, ^ and $ refer to the whole file. But if you use latest master, you can try:

%ignore /\\\\$/m

The 'm' flag tells the regexp to match $ to newlines. I also merged the Earley optimization to master, so you shouldn't have any problem with that.

But, I must say, it's a pretty strange language, if backslashes do nothing in it. Usually it has some structural role, such as continuing the line. In which case you will have to ignore the newline as well. Perhaps like this?

%ignore /\\\\$\n/m

I must say I never tried something like this, so let me know if it doesn't work, and I'll try to figure out what to do next.

gerases commented 6 years ago

%ignore /[\\]/ can ignore any backslash, not just at the end of the line. By default, ^ and $ refer to the whole file. But if you use latest master, you can try:

%ignore /\\$/m

Excellent

The 'm' flag tells the regexp to match $ to newlines. I also merged the Earley optimization to master, so you shouldn't have any problem with that.

Cool

But, I must say, it's a pretty strange language, if backslashes do nothing in it. Usually it has some structural role, such as continuing the line. In which case you will have to ignore the newline as well. Perhaps like this?

%ignore /\\$\n/m

Indeed the backslash is a line continuation. I will try to modify and watch out for any problems.

I must say I never tried something like this, so let me know if it doesn't work, and I'll try to figure out what to do next.

Will do.

There's one small thing I discovered yesterday that I wanted to ask you about. Given this input:

DBA ALL = (oracle) ALL

the tree looks like this:

  user_list
    user        DBA
  host_list
    host        ALL
  cmnd_spec
    runas_list
      user      oracle
    cmnd_alias_name     ALL

After I apply my transformations, which are very simple, I get back a tuple that looks like this:

('user_spec', [['DBA'], ['ALL'], [['oracle'], 'ALL']])

So far so good.. If the source is two lines like so:

DBA ALL = (oracle) ALL
DBA ALL2 = (oracle2) ALL

three looks like this:

user_spec
    user_list
      user      DBA
    host_list
      host      ALL
    cmnd_spec
      runas_list
        user    oracle
      cmnd_alias_name   ALL
  user_spec
    user_list
      user      DBA
    host_list
      host      ALL2
    cmnd_spec
      runas_list
        user    oracle2
      cmnd_alias_name   ALL

or any single rule, the output comes out as an array of two tuples:

[
   ('user_spec', [['DBA'], ['ALL'], [['oracle'], 'ALL']]),
    ('user_spec', [['DBA'], ['ALL2'], [['oracle2'], 'ALL']])
]

As a result, I have to first check if the result is a tuple or an array. If it's a tuple, I create a 1-element array from the tuple.

Is that expected that a single rule might return a tuple while two will return an array? Is it something about my grammar? This can be a tough question and since it's not critical, you can ignore it :)

erezsh commented 6 years ago

Well, I assume user_spec always returns a tuple as it should. But you match it in ?sudo_item, and the ? sign literally means: If there is only one item, return it instead of appending it as a branch.

Does that answer your question?

gerases commented 6 years ago

yep, it certainly does answer my question.

thank you!

On Nov 16, 2017, at 1:03 PM, Erez Shinan notifications@github.com wrote:

Well, I assume user_spec always returns a tuple as it should. But you match it in ?sudo_item, and the ? sign literally means: If there is only one item, return it instead of appending it as a branch.

Does that answer your question?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

gerases commented 6 years ago

With the default parser, is there a way to see which line of input caused the problem?

gerases commented 6 years ago

I get the error below when trying to ignore the backslash with %ignore /[\\\\]$/m

lark.common.GrammarError: Rules aren't allowed inside tokens (m in __IGNORE_0)

erezsh commented 6 years ago

You should use the latest master branch for now, instead of the latest pypi release. I plan to make a release soon, once I see all the recent breaking changes are stable.

erezsh commented 6 years ago

Regarding line and column for the default parser, I haven't implemented it yet, but I know it's a crucial feature. Feel free to open a new issue for it if you like, and I'll try to get to it soon.

gerases commented 6 years ago

Last question: is there a built in function to read the grammar from a file?

erezsh commented 6 years ago

Any reason why the default open(filename).read() isn't good enough?

gerases commented 6 years ago

No reason really. Just thought there might be a convenience function. All good. Thanks ago. I think we can close this ticket. Thanks again!

erezsh commented 6 years ago

Well, not sure if it makes any difference to convenience, but you can pass file objects to Lark, so something like Lark(open(filename)) will work just as well.

Thanks for the bug reports and suggestions!

gerases commented 6 years ago

Yep, that's quite good enough.