Two thoughts on parsing (string matching and contains modifier)

newodahs commented 2 years ago

Hey there,

While I was fixing for https://github.com/markuskont/go-sigma-rule-engine/pull/5, I took a deeper look at two items:

The sigma spec states that values are treated as case-insensitive and wildcards may be used (which is supported by looking for the * glob character), so it made me ask the question if the library should actually do strict case-insensitive comparisons by default and rely on wildcards for non-regex variations in strings?

I put a branch (https://github.com/newodahs/go-sigma-rule-engine/commit/7a20901da45d375c52c428ca77df53280cea0e68) that does this and wanted to see what you thought. I think going down this path also might change this logic a bit more where modifiers take the higher priority, then we fall back to detecting glob and old-style regex (~not done as of writing, just a bit of reshuffling needed~ completed in https://github.com/newodahs/go-sigma-rule-engine/commit/f22eb7eb915c66ff1f63f5e6fe4f946ce92d57bc and https://github.com/newodahs/go-sigma-rule-engine/commit/3c15a52a14e655b1df9078b77aebbade787f6250).

This change is also in line with the spec'd support for the modifier contains where it should be the rule with * on either side of it.

I think there is a potential unintended behavior with the current glob library (I put a comment to this the above branch as well) - basically it doesn't respect escaping wildcards, which while subtle, could cause more false positives than desired. I did find glob.go (github.com/gobwas/glob) which seems to support escaping and is MIT licensed; I may try replacing the existing one glob with it to see if it works (I haven't really dug deep into this either, so...).

EDIT: I did prototype out the gobwas/glob library (not pushed yet), which does seem to work fairly well as near as I can tell. It wouldn't fully replace the other version of glob used in extractWildcardIdents as I don't think the new library supports that use case very well.

EDIT 2: I finished the gobwas/glob work (here: https://github.com/newodahs/go-sigma-rule-engine/commit/d6609a0fa442336fdb68f350ee5328f787cae043 -- based on value-rework branch); still in a branch on my fork, I think the modifications make sense though.

markuskont commented 2 years ago

Hi. Yes, I agree that case insensitive default makes sense. I think I've actually ran into rule regressions due to how some log cases were changed in filebeat.
If I remember correctly I chose the glob library because it was very minimal and did exactly what I would have done myself. I don't mind using another one if it works better.

Thanks a lot! I will try to find time over weekend to look into your work.

newodahs commented 2 years ago

No problem; happy to help a project that is helpful to me!

With all of the above, I did spend some time last night merging all of my changes noted above into my fork's master (https://github.com/newodahs/go-sigma-rule-engine/compare/v0.2.4-alpha...newodahs:master should give you the right compare), so it may be easier to see the final results there. Note, if you're looking at my fork, I tagged 0.2.4-alpha as just my re modifier work checkpoint while 0.2.4-beta/master has everything together.

The one other thing of note is 'Keywords' - I think I noted it in comments or a commit, basically I made it so Selections default was strict case-insensitive, however, due to how keywords work, I think it makes sense to carve out a special case for them to treat them as 'contains' always, unless they're setup as an old style regex.

I looked through a smattering of Sigma rules with Keywords being used and most are either setup for strict compare OR glob comparison; however, there were a handful that were one or two words that I couldn't see them being a strict message comparison by themselves (no modifiers). With that, I made a special TextPatternKeyword specifier that basically, if no modifiers are found and that specifier is attached we:

Look for the old style regex (starts and ends with /
Treat it as a contains modifier (insert * at the start and end of string) and glob it

So if not the first, do the second for keywords - we'd never strict case-insensitive compare them; I think this still leaves some room for false positives, but I'm not sure how common that case would be.

markuskont commented 2 years ago

Yes, I noticed your comment about Keywords before reading this even. In our use case I believe vast majority of rules were still selections, which we had to transform anyway to match our event structure.

A popular use-case for keywords, if I remember correctly, was match this list of known bad commands. Such as commandline in sysmon, full command in snoopy, audit logs, etc. In those cases we'd need to do more generic matching I think to also deal with possible red team bypass methods. I don't think case-insensitive matching would really cause much FP here. But it would help against bypass methods and software changes.

However, this makes me think actually about whitespace handling. A single extra whitespace in malicious command could totally negate detection rule. I wonder if we could improve the handling here to match any number of whitespaces between CLI flag and argument for example. What do you think?

newodahs commented 2 years ago

For whitespace, we could try to collapse all whitespace into a single WS per each string:

collapseWS := regexp.MustCompile(`\s+`)
input = collapseWS.ReplaceAllString(input, " ")

You could leave this up to the caller too (make them responsible for normalizing the data they provide into the library).

That said, there is something appealing to handling this transparently for the caller as well, so exploring the thought further... We have two sides to consider here:

Input from the caller
The sigma rule itself

I think we could safely collapse the caller's input in many cases and not impact much; where we run into potential trouble is if a sigma rule is written expecting uneven or extra white space.

I think the simple rule set is (for a piece of data hitting a given rule):

If the sigma rule is a regex, don't collapse anything (always bounce data off of regex directly); regex can be complex in their whitespace handling/expectations
If the sigma rule is a glob or direct compare, we could collapse the whitespace in both the rule and the data for best chance at matching as both sides are collapsed equally (we can collapse rules when we read them in and the caller's data in the StringMatcher, JIT style).

Maybe make this an option too for the caller? A config flag that turns on or off the automatic collapsing of whitespace?

See https://github.com/newodahs/go-sigma-rule-engine/commit/c2b3b1a0e4f54a1dcf1c867aa1215b4bebc44c9d for a POC of collapsing whitespace I threw together in a branch; seems to work OK

markuskont commented 2 years ago

Yep, I agree with your points.

Explicit regexp patterns should not do that. There's already a way to express whitespace there.
making it optional is a must for sure, as regexp replacement will have a performance overhead

I think the PoC is good. I realize there's a lot of repeat code with that, but I cant think of a better way to do it with current interface approach.

newodahs commented 2 years ago

I'll probably take another swing at cleanup today/tomorrow to make it a bit more production ready then move to master and open a pull request.

newodahs commented 2 years ago

Another pass at collapsing Whitespace, different branch; refactored to remove the global and pass around the config similar to some of the other options; probably needs a close review, especially the changes to detection (which may break JSON parsing, though I'm not sure if that's a real thing)?

https://github.com/newodahs/go-sigma-rule-engine/commit/c617f21d98947088cde7b64f89a2f83136ce4f90

markuskont commented 2 years ago

Hmm, I'm not really a big fan of this approach. The Rule object was meant to be a data model to reflect sigma rule structure. So adding our own fields there and hacking around json / yaml decode feels hacky. Not worried about breaking JSON decode if YAML can handle this, but the idea was to document our understanding of Sigma rule structure in code. So new struct would make it confusing.

In fact, since the WS collapse is our own thing and not part of Sigma rule specification, then I don't think this is the correct place to pass the parameter. Rather, it should be a parameter that can be passed from Config struct to NewTree, possibly via intermediate field in RuleHandle. That's because NewRuleList simply decodes YAML rules into a slice of structs while NewTree is actually the constructor to create our ruleset. Ultimately we need the whitespace parameter when parsing the rule tokens into actual matcher list. We don't really need it already when parsing the rule.

This makes me think that maybe I made too many public constructors as at the time I was unsure how the constructor should look like and I wanted to be able to call them with granularity in benchmarking and testing commands. Perhaps I should refactor that and make many of the intermediate constructors private. What are your thoughts? I don't know what constructors you actually find useful? Should YAML decode and rule parsing be separate logical steps in application, or should we handle both under the hood? We should open a separate issue on that.

Another separate issue I think we should open is restructuring the project as a whole. Currently it's structured like a CLI application in go-project-layout standard, with cmd CLI section and pkg library section. Yet it was always meant to be used as library, and not provide a CLI tool. Looking at some of the code I wrote into cmd, I can understand why you took this approach. But I think my examples put focus into the wrong place and that can make developments like this more complicated than needs to be. Moving all of the library code into the root and making examples folder for CLI guidelines would be much better I think. Thought that would be a breaking change for anyone who already imports this project. This needs a bit of a discussion.

newodahs commented 2 years ago

From my perspective, the only actual constructor I use today is NewRuleset, though honestly I've kicked around the idea of an alternate version of NewRuleList that instead of reading the files for me, takes a list of byte[] that is the yaml as read from somewhere. I considered doing my own unmarshalling/processing and calling NewTree from my application, but that left me with app code that knew a bit too much about the inner working of the library.

I have a use case where I'd rather store some of the sigma rule data in another data source (e.g. database); so a clean interface that just Unmarshalls and error handles what I give it would be idea (let me deal with getting the rule data for the library).

So I guess from my perspective I'm somewhere in between the two options? I think the library handling the yaml decode and parsing under the hood is ideal with the dividing line being letting the calling application handle reading/getting the data to us (we can leave the option to let the library read from file to not break backward compatibility).

As for the rest, let me take a swing at pulling it out of Detection and passing it around as suggested.

As for the project restructure, I think you're right; I've never been a huge fan of the go-project-layout standard generally (I can see where it has some advantages), though as you point out it'd be a pretty nasty breaking change for anyone consuming what is there today. I need to think on this more...

markuskont commented 2 years ago

Thank you for the input and help. Really appreciate it!

I opened separate issues to the two comments. Perhaps the issue is not too many constructors but simply we need better organization. And I need add documentation to explain original meaning behind them. Really like the idea of building rules from non-yaml sources though as well. Many people have databases for that sort of thing.

newodahs commented 2 years ago

No problem, I fell a bit behind this weekend on some personal items so once I get my head back above water I'll post up my progress on the rest and add some comments to the new issues above.

newodahs commented 2 years ago

OK, updated the noglobal branch (https://github.com/markuskont/go-sigma-rule-engine/compare/master...newodahs:ws-collapse-noglobal)

Basically, the path now is Config -> RuleHandle -> Parser for the option; at the end of the day we need to know as we're setting up our Keyword and Selection structs if we need to both collapse the rule setting them up and collapse the data as it comes through (or not). This also keeps it out of the Sigma rule structure definition and leads to (IMHO) a better interface for testing as well.

Let me know, if it looks good, I'll open a pull request.

markuskont commented 2 years ago

The config param also needs to be plugged into NewRuleList constructor on line 68 of rule.go. Coupled with a corresponding argument on line 44. Otherwise looks good to me.

Thanks!

newodahs commented 2 years ago

Thanks for the heads up on that; after all those changes I managed to forget to expose the setting via config... should be all set now and pull request opened.

I have a couple of other ideas I'm kicking around too; I'm going to try some POC code and I'll open other discussions with you as those come together more substantially.

Let me know if there is anything else I can help with!

markuskont commented 2 years ago

Awesome. Merged. Thanks for the contribution!

Looking forward to proposals. Also, let's keep discussing branched out issues on project and constructor reorganization in new issues. And I will be sure to ping whenever I get new ideas.

Meanwhile, I will close this issue, as it became a wider discussion and it's difficult to keep track of all points. If anything was left dangling, please open a dedicated issue and we'll take it from there.

Cheers

markuskont / go-sigma-rule-engine

Two thoughts on parsing (string matching and contains modifier) #6