Escaping in option value breaks parsing

Gladskih commented 4 years ago

When I try to parse rule with escaped semicolon it fails.

alert tcp $HOME_NET any -> $EXTERNAL_NET any (content:"qwerty\;";)

lark.exceptions.UnexpectedCharacters: No terminal defined for '"' at line 128 col 55
E_NET any -> $EXTERNAL_NET any (content:"qwerty\;";)
                                        ^
Expecting: {'LITERAL', 'BANG', '__ANON_8'}

About escaping in docs:

The characters ; and " have special meaning in the Suricata rule language and must be escaped when used in a rule option value. For example: msg:"Message with semicolon\;"; As a consequence, you must also escape the backslash, as it functions as an escape character.

But maybe my content value example is malformed as 6.7.1. content states also:

There are characters you can not use in the content because they are already important in the signature. For matching on these characters you should use the heximal notation. These are: " |22| ; |3B| : |3A| | |7C|

BTW the rule is accepted by suricata itself.

theY4Kman commented 3 years ago

Mmm, yeah, it appears the original string regex attempted to allow escaped double-quotes, semicolons, and backslashes using a negative lookahead: (?!\\)\\[;\\"]. This pattern tried to match \", \;, and \\, without matching \\; (an escaped backslash before a contraband character). But the negative look ahead is the wrong thing — the pattern (?!\\)\\ means "match any \ character which isn't a \ character" :P

What was needed was a negative look behind: (?<!\\)\\. This means "match any \ character not preceded by a \ character".

I'll have this fixed up shortly, and include the colon, as well.

Also, in reading the docs, I'm realizing hex notation is not interpreted at all, and even though this fixed regex picks up escaped characters, no interpretation is being performed on them, either. Which is to say, the resulting strings from the parser do not match the actual content the rule source describes — it merely reflects the literal characters written in the rule. I'm not sure how you or anyone else uses this library, but I would like to offer at least the option to retrieve the actual content as a Python str/bytes — e.g. Setting('|00|butt').parsed == '\x00butt' or something.

theY4Kman commented 3 years ago

I thought I would have this out shortly, but I'm fuckin something up with the regex. Fortunately, I just discovered Lark provides an escaped string terminal :P Now it ought to be out shortly.

theY4Kman commented 3 years ago

Okie dokes, fixed in #7, and uploaded to PyPI as version 0.2.3

theY4Kman / parsuricata

Escaping in option value breaks parsing #3