lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.81k stars 409 forks source link

Improve IMAP ID parser #1372

Closed P-EB closed 10 months ago

P-EB commented 10 months ago

What is your question?

Hello, I am trying to make a parser to parse IMAP client ID sent to a webapp API.

The ID is contained in a json and the specific entry is of the form:

"ID": "(\"name\" \"Client name\" \"version\" \"bleh\" ...)

So, to be clear, it's a string in the form "(" (NAME VALUE)* ")" . NAME being one of ("name", "version", "os", "os-version", "vendor", "support-url", "address", "date", "command", "arguments", "environment"), with the double quotes around, and VALUE being arbitrary string, so an ESCAPED_STRING, as it has double quotes around.

So far what I did is:

rfc_id_fields = ("name", "version", "os", "os-version", "vendor",                                         
                 "support-url", "address", "date", "command", "arguments",                                
                 "environment")                                                                           
joined_fields = '\\\""|"\\\"'.join(rfc_id_fields)

grammar = f"""                                                                                            
start: "(" named_fields* ")"                                                                                  

named_fields: NAME ESCAPED_STRING                                                                             
NAME: "\\\"{joined_fields}\\\""                                                                               

%import common.ESCAPED_STRING                                                                                 
"""

imap_id_parser = lark.Lark(grammar, parser='lalr')                                                        

parsed_id_string = imap_id_parser.parse(id_string)

It does work, even if the value is containing escaped double quotes.

That being said it's a bit ugly, because afterwards when I iterate on my tree, I need to remove the double quotes to work with what I parsed, and also, the value being arbitrary, it could contain double quotes which won't get properly unescaped if I drop the double quotes via sub-stringing.

Is there a way to change the "named_fields" rule to implicitly drop the double quotes in the NAME terminal and how can I properly unsecape the ESCAPED_STRING?

Thanks in advance! :)

MegaIng commented 10 months ago

Can you provide a few examples? I don't really understand the format, and how it's "contained in json". The example (?) you provided just looks very invalid.

P-EB commented 10 months ago

@MegaIng it looks invalid because it's en excerpt, but the right hand side is exactly what I gave, it's just that github filtered some backslashes, I escaped the excerpt.

Here is a json :-)

{
  "event": "imap_command_finished",
  "hostname": "host.example.com",
  "start_time": "2023-11-23T10:59:53.033463Z",
  "end_time": "2023-11-23T10:59:53.033673Z",
  "categories": [
    "imap",
    "service:imap"
  ],
  "fields": {
    "user": "dovecot_user_login",
    "local_ip": "XXXXXXX",
    "local_port": 993,
    "remote_ip": "YYYYYYY",
    "remote_port": 10904,
    "session": "T2w7vM8KmCrZRrUC",
    "duration": 121,
    "cmd_tag": "41",
    "cmd_name": "ID",
    "cmd_input_name": "ID",
    "cmd_args": "(\"name\" \"Thunderbird\" \"version\" \"115.4.1\")",
    "cmd_human_args": "(\"name\" \"Thunderbird\" \"version\" \"115.4.1\")",
    "tagged_reply_state": "OK",
    "tagged_reply": "OK ID completed.",
    "last_run_time": "2023-11-23T10:59:53.033432Z",
    "running_usecs": 93,
    "lock_wait_usecs": 0,
    "bytes_in": 42,
    "bytes_out": 67,
    "reason_code": [
      "imap:cmd_id"
    ]
  }
}

The content is in cmd_args or cmd_human_args

It follows this RFC.

erezsh commented 10 months ago

From a short glance, it doesn't really look like Lark is necessary here. You could probably get this done with just a regexp.

MegaIng commented 10 months ago
erezsh commented 10 months ago

this isn't a dangerous use of eval

That is what https://docs.python.org/3/library/ast.html#ast.literal_eval is for.

MegaIng commented 10 months ago

That is what https://docs.python.org/3/library/ast.html#ast.literal_eval is for.

Which also just calls eval in this situation, look at the sourcecode.

P-EB commented 10 months ago

From a short glance, it doesn't really look like Lark is necessary here. You could probably get this done with just a regexp.

You're totally right, it was my first move, but I think it's more failproof to use lark, and, also, far more elegant. Also, theoretically, the values may contain double quotes, which won't get handled properly by a classic regex logic, would it?

  • You should just use the stdlib json parser for the entirqe json. You will then get a string of the form ("name" "Thunderbird" ...) i.e. without the extra outside quotes or the extra backslashes in the middle.

Already done, the backslashes you see is just because it was extracted from a json representation, but of course the string I give to Lark is ("name" "xxx" "version" "xxx")

  • You can then use a slightly simpler grammar than what you have right now since you don't have to deal with the backslashes on the outside.

I don't see how, this doesn't work when I try to drop the escaped quotes I put.

  • to convert an ESCAPED_STRING to an actual python string, you can just use eval. Since you now that it matches a specific regex, this isn't a dangerous use of eval.

Ack, thanks!

MegaIng commented 10 months ago
  • You can then use a slightly simpler grammar than what you have right now since you don't have to deal with the backslashes on the outside.

I don't see how, this doesn't work when I try to drop the escaped quotes I put.

Right, your current tripled up backslashes are because you aren't using raw strings, missed that part.

P-EB commented 10 months ago
  • You can then use a slightly simpler grammar than what you have right now since you don't have to deal with the backslashes on the outside.

I don't see how, this doesn't work when I try to drop the escaped quotes I put.

Right, your current tripled up backslashes are because you aren't using raw strings, missed that part.

I'm open to implement something more elegant if you think there is.

erezsh commented 10 months ago

@P-EB At the end of the day, ESCAPED_STRING is parsed as a regexp

If you use a regexp directly, you can also use capture groups to remove the double quotes, and not have to call eval, which is arguably a bit more efficient. (though that doesn't matter much)

P-EB commented 10 months ago

@P-EB At the end of the day, ESCAPED_STRING is parsed as a regexp

If you use a regexp directly, you can also use capture groups to remove the double quotes, and not have to call eval, which is arguably a bit more efficient. (though that doesn't matter much)

I read you, but as stated the tuple might contain double quotes on the values (eg: ("name" "Blah \" blah")) and while tokenization with lark seems to handle this properly, I am not aware of a way to do that efficiently with regexp. Do you have a solution I'm not aware of? I'm totally open to the idea that I don't know a feature of re.

erezsh commented 10 months ago

Again, Lark does that with a regexp... Anyway, this isn't exactly an issue with Lark. Next time, such discussions are better placed at the discussions tab: https://github.com/lark-parser/lark/discussions

P-EB commented 10 months ago

Sorry I thought that the question tag was designed for this purpose.