guilhemmarchand / trackme

TrackMe - Data tracking system for Splunk admins
Apache License 2.0
49 stars 11 forks source link

TrackMe bug report - The long term tracker can impact some data sources unexpectly in some specific conditions #364

Closed bdcrandall closed 2 years ago

bdcrandall commented 3 years ago

Describe the bug The dispatch.latest_time in the long term tracker searches in savedsearches.conf is incorrect. The FAQ says it should be +4h for both searches but it is set to -4h by default.

Splunk version and deployment:

Additional context The way this manifests in my environment is that I have a bunch of data sources that flip red at 1 minute past the hour and then go back green at 5 minutes past the hour. It doesn't happen every hour but it happens enough that email alerting is basically worthless.

guilhemmarchand commented 3 years ago

Hi @bdcrandall !

Thanks for opening this issue.

Right, so my bad these sentences in the FAQ are outdated, I indeed see it too:

For reference:

Short term trackers run every 5 minutes, earliest=-4h latest=+4h
Long term trackers run once per hour, earliest=-7d latest=+4h

This is not correct, but the app itself is correct, the short term runs at:

dispatch.earliest_time = -4h
dispatch.latest_time = +4h

The long term runs at:

dispatch.earliest_time = -24h
dispatch.latest_time = -4h

This to do such that the long term does not look at sources which are to be handled by the short term, which in addition does:

| where data_last_time_seen<relative_time(now(), "-4h-5m")\

The full code for context:

[TrackMe - hosts availability long term tracker]
cron_schedule = 1 * * * *
description = This scheduled report tracks and updates the data source availability KVstore based lookup
dispatch.earliest_time = -24h
dispatch.latest_time = -4h
enableSched = 1
request.ui_dispatch_app = trackme
request.ui_dispatch_view = trackme
dispatch.ttl = 900 # 15m ttl for this artefact
search  = | savedsearch "TrackMe - Data hosts abstract root tracker"\
\
`comment("#### collects latest collection state into the summary index ####")`\
| `trackme_collect_state("current_state_tracking:data_host", "data_host")`\
\
| where data_last_time_seen<relative_time(now(), "-4h-5m")\
`comment("#### output flipping change status if changes ####")`\
| `trackme_get_flip(data_host_state, data_previous_host_state, data_host, trackme_audit_flip_temp_data_host)`\
| `trackme_outputlookup(trackme_host_monitoring, key)`\
| stats c

So, in therory, the long term should not be handling a data source that is handled by the short term at all (because it is in the scope of the short term only and excluded de facto)

I've not been warmed about any issue on this side (and there's a lot of active deployment) nor in the several PROD deployment I am seing right now, so if there's a thing there must an explanation I'll be happy to get with you.

Can we:

index="trackme_summary" source="current_state_tracking:data_source"

You will note that there is a field named search_name which corresponds to the name of the tracker, so you can identify if a same data source was handled by both trackers and what were the statuses rendered via the summary data at this point in time.

Technically, if a data source is reported by the long term, this should be because its data was out of the window of the short term.

I would suspect this might have to see with some TZ related settings, given that the relative time commands in SPL are linked to the user context too, but that's just an idea potentially.

First let's get facts and we can start from there

guilhemmarchand commented 3 years ago

To complete, the field "data_last_time_seen" is infact the latest event found in the data source, from the _time point of view:

max(_time) as data_last_time_seen

(performed at the very beginning of the search in the tstats).

Then within the long term search, it excludes any data source which has data that was seen in the scope of the short term search.

It seems to me to be consistent so likely more an issue on your side I do believe, potentially you could increase this value to get more value, I mean updating the long term search and increase:

| where data_last_time_seen<relative_time(now(), "-4h-5m")\

Something like having 15 minutes more margin, things like that

bdcrandall commented 3 years ago

Unfortunately I can't upload a screenshot but searching yesterday (before I changed the latest time parameter to +4 hours) I have 112 data_names that are searched by both the short term and long term searches. Here's the search I ran to find that count:

index="trackme_summary" source="current_state_tracking:data_source" | stats dc(search_name) as searches by data_name | search searches > 1 | stats count

I definitely have time zone issues with a bunch of my feeds but not that many.

guilhemmarchand commented 3 years ago

@bdcrandall

So you've changed the latest time from the long term search like the following?:

dispatch.earliest_time = -24h
dispatch.latest_time = +4h
bdcrandall commented 3 years ago

Correct, I made that change yesterday morning. I haven't noticed any excess status flipping since then, only 5 email alerts and all highlighted legitimate issues.

guilhemmarchand commented 3 years ago

Thanks @bdcrandall

I am reviewing this carefully, this might require some additional safeties to avoid any collision between the two, or allow indeed as you did the long term to handle any available data in the time range.

Do you have data appearing as seen in the future? the reason I ask is that the long term is using the _time (so the event time stamp as parsed and interpreted by Splunk) which if inaccurate could lead to this use case.

In anyway, on it ;-)

guilhemmarchand commented 2 years ago

Coming back on this one @bdcrandall as I was carefully reviewing it and you are right.

In some specific cases, the long term tracker can look at some data sources which should have been handled by the short term tracker only, and in such a case this leads to impact its status as the long term does latest -4h.

There are 2 solutions:

| where ( data_last_time_seen<relative_time(now(), "-4h-5m") AND data_last_ingest<relative_time(now(), "-4h-5m") )

Instead of the current:

| where data_last_time_seen<relative_time(now(), "-4h-5m")

Will be added to 1.2.58 very shortly.