Bugfix compound query_key and mixed mappings

jmacdone commented 9 months ago

Description

Fixes to allow top_count_keys to still function when using a list of query_key values.

Checklist

[x] I have reviewed the contributing guidelines.
[x] I have included unit tests for my changes or additions.
[x] I have successfully run make test-docker with my changes.
[x] I have manually tested all relevant modes of the change in this PR.
[ ] I have updated the documentation.
[x] I have updated the changelog.

Questions or Comments

tl;dr fc583f77fb6c3eb134ddd3164c908441c3086898 and e0795fe556a8b9b72aaf305237f8f5c72823ab84 are bug fixes. 9d1f291e24c9ef8d5ae246a281eae5a06cd770f2 is a guess.

I didn't attempt unit tests as a. get_hits_terms looks strongly coupled with elasticsearch results and b. there wasn't an existing get_hits_terms unit test (that I could find)

Confirmed through manual testing that a single query_key still behaves as before and the list of query_key values now populates topevents* as expected.

I have low confidence 9d1f291e24c9ef8d5ae246a281eae5a06cd770f2 is the correct way to handle raw_count_keys=false but it's working (finally) as expected, when dealing with a mix of text terms and keyword, ip, int, etc. terms. As it stands now, with this PR, a rule with this config works as I expect.

query_key:
  - serialnumber.keyword
  - src_computer.keyword

top_count_keys:
  - src_displayname.keyword
  - src_computer.keyword
  - src_uid    # mapping type: keyword
  - src        # mapping of type: ip

raw_count_keys: false

Manual testing done on a python 3.11 image (similar to the Dockerfile) with vscode debugger checkpoints around the changes

"dev.containers.source": "https://github.com/devcontainers/images",
"dev.containers.variant": "3.11-bookworm"

make test-docker

py311: OK (585.32=setup[491.51]+cmd[91.95,1.86] seconds) docs: OK (492.05=setup[476.75]+cmd[15.30] seconds) congratulations :) (1077.43 seconds)

nsano-rururu commented 9 months ago

I forgot to update CHANGELOG.md

jertel commented 9 months ago

Does this related at all to #982 ?

Also some unit test coverage are needed for these changes, even if there hadn't been coverage previously. The new .keyword logic should be a straightforward unit test since there are already lookup tests in util_test.py. For the get_hits_terms changes if you have to split out the logic into its own helper function so that you can isolate the unit testing to just that logic, that is fine, and preferable anyway.

Regarding @nsano-rururu's comment I think he meant "You forgot to update the changelog".

jmacdone commented 8 months ago

A status update: I've been grinding on this for a while. While trying to write unit tests the current PR, (it feels like) I keep finding things like https://github.com/jertel/elastalert2/blob/0716fd3211a5400397d4bd2ab02b71bc0eb4d501/elastalert/ruletypes.py#L948-L951 and https://github.com/jertel/elastalert2/blob/0716fd3211a5400397d4bd2ab02b71bc0eb4d501/elastalert/elastalert.py#L1326 that assume query_key will be a string representing a single fieldname, but https://github.com/jertel/elastalert2/blob/0716fd3211a5400397d4bd2ab02b71bc0eb4d501/elastalert/loaders.py#L416 will create a comma-joined string of fieldnames (multiple).

So, I'm scratching my head as to what would be the right fix. I'm leaning towards adding guards to check for compound_query_key before using query_key for those two examples (and any others found).

But, I'm also considering "should query_key just always be normalized to a list?" and then remove the "hidden" compound_query_key created as a side-effect of loading rule options and chase after all the broken that'd create. Probably way more work, but this somehow feels "more correct" at the moment.

jmacdone commented 8 months ago

Does this related at all to #982 ?

They could very well be related. One of the symptoms of this PR's bug is "man, why does top_events never seem to work for me?!"

jertel commented 8 months ago

From what I recall query_key is intentionally not supposed to be a list, since Elasticsearch's aggregations will respond using the comma delimited string as the bucket key for a compound query key, and we need to repeatedly fetch that bucket using the comma delimited form. I believe this is the case, it's been a while since I dove into that area.

Now that I'm looking more closely at these code changes it appears your new split(", ") will not work correctly in some cases. The rule loader uses join(",") which will create that compound string without spaces. When I first reviewed this I was thinking split(delim) treated any character found in delim as the delimiter string, but that's not the case. Instead it treats the full delim string as the delimiter. A safer approach would be to check if qk.contains(",") to determine if this is a compound key or not on line 471 in elastalert.py. Then later where qk_list[i] is used, add a .strip() to that reference to remove any extra whitespace for instances when the query_key value is specified with spaces in the comma delimited format.

That said, I'm now wondering what scenario it was where you found whitespace characters in the query_key string value in the first place, which led to your change toward split(", "). Were you manually specifying a "query_key: key1, key2" in the rule YAML file, instead of using the list format?

jmacdone commented 8 months ago

what scenario it was where you found whitespace characters in the query_key string value in the first place

I've been wondering the same. After more debugger time, I see the get_hits() method calls the process_hits() method which adds rules options (like rule['query_key']) to the _source dict returned from elastic.

https://github.com/jertel/elastalert2/blob/0716fd3211a5400397d4bd2ab02b71bc0eb4d501/elastalert/elastalert.py#L337

remove any extra whitespace for instances when the query_key value is specified with spaces in the comma delimited format

It looks it ~used to~ tried to work that way ~5 years ago https://github.com/jertel/elastalert2/commit/76ab593aeb57355d7084314804d7e89eb2449157

jmacdone commented 8 months ago

~I'm not sure how to how find the blame for a deleted line, but I'm curious as to when and why this one was removed~ Ah, this was meant to handle the whitespace stripping, but didn't, because it never used the qk_tmp value https://github.com/jertel/elastalert2/blob/76ab593aeb57355d7084314804d7e89eb2449157/elastalert/elastalert.py#L442-L443

and because the qk_tmp was never used, it was stripped in https://github.com/jertel/elastalert2/commit/8abff23afcd75234d2d4c95f99cc630cc57b1c31

leaving us with the old commaspace-split replaced with a comma-split. i.e. https://github.com/jertel/elastalert2/commit/76ab593aeb57355d7084314804d7e89eb2449157 introduced buggy behavior.

jertel commented 8 months ago

Isn't this line https://github.com/jertel/elastalert2/blob/0716fd3211a5400397d4bd2ab02b71bc0eb4d501/elastalert/elastalert.py#L337 comma-space joining the values of the returned query keys? It looks like it is setting that new string as the value for the query_key key, rather than the key itself.

jmacdone commented 8 months ago

Confusing, right? Sometimes qk is a key, and sometimes holds the value of a key. In one of my example hits, https://github.com/jertel/elastalert2/blob/0716fd3211a5400397d4bd2ab02b71bc0eb4d501/elastalert/elastalert.py#L337 results in

hit['_source']['serialnumber.keyword,src_computer.keyword'] = 'NXHFPAA001234567890, LAPTOP-M16RFTYS'

and when it comes time to alert, the alert calls get_hits_terms(..., qk='NXHFPAA001234567890, LAPTOP-M16RFTYS', ...) which can't be split properly by https://github.com/jertel/elastalert2/blob/0716fd3211a5400397d4bd2ab02b71bc0eb4d501/elastalert/elastalert.py#L466-L469

And this is in the context of a cardinality rule configured with

query_key: 
  - serialnumber.keyword
  - src_computer.keyword
raw_count_keys: false

jertel commented 8 months ago

Yes, that qk variable needs to be renamed qk_value to eliminate confusion. Then there's the issue with a pre-joined value itself containing a comma, and later the code thinks it's splitting apart the query key into their respective values, but now the extra embedded commas are going to trip up that split. It could use a better design. Perhaps a quick fix is to join those values with a more explicit delimiter, such as __EA2_DELIM__ and then the join and split statements can share a single constant var. It would help avoid future maintainers and show how those two join and split statements are connected.

jertel commented 8 months ago

Thinking more on this... The drawback to using that explicit delimiter is that is any other code or existing users relying on that field are now going to see that ugly delimiter in their alert message.

If we leave it to the comma delimited approach we could at least do a sanity check after the split and see if the length of the split value array matches the compound_query_key array. If it doesn't then at least log an error for visibility, or even append a message to the value that the top counts couldn't be accurately determined.

But this isn't even the problem you're trying to solve so apologies for muddying up the issue.

Back to the original code review, there may still be some benefit leaving the split as-is and using a .strip() on the qk_list[i] reference.

jmacdone commented 8 months ago

added a requirement for >=py3.9 (it seems the project wants to be on 3.12 regardless)
rebased from master
used re.split() to handle variations of comma and commaspace
added the suggested waring if query_key values just-so-happen to include commas
snark block comments
re-tested with test-docker

I left the commits "organic". But if the project prefers consolidated commits, I can do that.

jertel commented 8 months ago

I'm not seeing any new commits.

jmacdone commented 8 months ago

:man_facepalming: git did what I asked rather than what I wanted. Better now?

jertel / elastalert2