Closed P-EB closed 10 months ago
Can you provide a few examples? I don't really understand the format, and how it's "contained in json". The example (?) you provided just looks very invalid.
@MegaIng it looks invalid because it's en excerpt, but the right hand side is exactly what I gave, it's just that github filtered some backslashes, I escaped the excerpt.
Here is a json :-)
{
"event": "imap_command_finished",
"hostname": "host.example.com",
"start_time": "2023-11-23T10:59:53.033463Z",
"end_time": "2023-11-23T10:59:53.033673Z",
"categories": [
"imap",
"service:imap"
],
"fields": {
"user": "dovecot_user_login",
"local_ip": "XXXXXXX",
"local_port": 993,
"remote_ip": "YYYYYYY",
"remote_port": 10904,
"session": "T2w7vM8KmCrZRrUC",
"duration": 121,
"cmd_tag": "41",
"cmd_name": "ID",
"cmd_input_name": "ID",
"cmd_args": "(\"name\" \"Thunderbird\" \"version\" \"115.4.1\")",
"cmd_human_args": "(\"name\" \"Thunderbird\" \"version\" \"115.4.1\")",
"tagged_reply_state": "OK",
"tagged_reply": "OK ID completed.",
"last_run_time": "2023-11-23T10:59:53.033432Z",
"running_usecs": 93,
"lock_wait_usecs": 0,
"bytes_in": 42,
"bytes_out": 67,
"reason_code": [
"imap:cmd_id"
]
}
}
The content is in cmd_args
or cmd_human_args
It follows this RFC.
From a short glance, it doesn't really look like Lark is necessary here. You could probably get this done with just a regexp.
json
parser for the entirqe json. You will then get a string of the form ("name" "Thunderbird" ...)
i.e. without the extra outside quotes or the extra backslashes in the middle.ESCAPED_STRING
to an actual python string, you can just use eval
. Since you now that it matches a specific regex, this isn't a dangerous use of eval
.this isn't a dangerous use of eval
That is what https://docs.python.org/3/library/ast.html#ast.literal_eval is for.
That is what https://docs.python.org/3/library/ast.html#ast.literal_eval is for.
Which also just calls eval
in this situation, look at the sourcecode.
From a short glance, it doesn't really look like Lark is necessary here. You could probably get this done with just a regexp.
You're totally right, it was my first move, but I think it's more failproof to use lark, and, also, far more elegant. Also, theoretically, the values may contain double quotes, which won't get handled properly by a classic regex logic, would it?
- You should just use the stdlib
json
parser for the entirqe json. You will then get a string of the form("name" "Thunderbird" ...)
i.e. without the extra outside quotes or the extra backslashes in the middle.
Already done, the backslashes you see is just because it was extracted from a json representation, but of course the string I give to Lark is ("name" "xxx" "version" "xxx")
- You can then use a slightly simpler grammar than what you have right now since you don't have to deal with the backslashes on the outside.
I don't see how, this doesn't work when I try to drop the escaped quotes I put.
- to convert an
ESCAPED_STRING
to an actual python string, you can just useeval
. Since you now that it matches a specific regex, this isn't a dangerous use ofeval
.
Ack, thanks!
- You can then use a slightly simpler grammar than what you have right now since you don't have to deal with the backslashes on the outside.
I don't see how, this doesn't work when I try to drop the escaped quotes I put.
Right, your current tripled up backslashes are because you aren't using raw strings, missed that part.
- You can then use a slightly simpler grammar than what you have right now since you don't have to deal with the backslashes on the outside.
I don't see how, this doesn't work when I try to drop the escaped quotes I put.
Right, your current tripled up backslashes are because you aren't using raw strings, missed that part.
I'm open to implement something more elegant if you think there is.
@P-EB At the end of the day, ESCAPED_STRING is parsed as a regexp
If you use a regexp directly, you can also use capture groups to remove the double quotes, and not have to call eval, which is arguably a bit more efficient. (though that doesn't matter much)
@P-EB At the end of the day, ESCAPED_STRING is parsed as a regexp
If you use a regexp directly, you can also use capture groups to remove the double quotes, and not have to call eval, which is arguably a bit more efficient. (though that doesn't matter much)
I read you, but as stated the tuple might contain double quotes on the values (eg: ("name" "Blah \" blah")
) and while tokenization with lark seems to handle this properly, I am not aware of a way to do that efficiently with regexp. Do you have a solution I'm not aware of? I'm totally open to the idea that I don't know a feature of re
.
Again, Lark does that with a regexp... Anyway, this isn't exactly an issue with Lark. Next time, such discussions are better placed at the discussions tab: https://github.com/lark-parser/lark/discussions
Sorry I thought that the question tag was designed for this purpose.
What is your question?
Hello, I am trying to make a parser to parse IMAP client ID sent to a webapp API.
The ID is contained in a json and the specific entry is of the form:
"ID": "(\"name\" \"Client name\" \"version\" \"bleh\" ...)
So, to be clear, it's a string in the form "(" (NAME VALUE)* ")" . NAME being one of ("name", "version", "os", "os-version", "vendor", "support-url", "address", "date", "command", "arguments", "environment"), with the double quotes around, and VALUE being arbitrary string, so an ESCAPED_STRING, as it has double quotes around.
So far what I did is:
It does work, even if the value is containing escaped double quotes.
That being said it's a bit ugly, because afterwards when I iterate on my tree, I need to remove the double quotes to work with what I parsed, and also, the value being arbitrary, it could contain double quotes which won't get properly unescaped if I drop the double quotes via sub-stringing.
Is there a way to change the "named_fields" rule to implicitly drop the double quotes in the NAME terminal and how can I properly unsecape the ESCAPED_STRING?
Thanks in advance! :)