trackme-limited / trackme-report-issues

The purpose of this repository is to allow Splunk community to report issues and enhancements requests
2 stars 0 forks source link

TrackMe bug report/question - Entity flipping although no anomaly is obvious #673

Open sebwurl opened 1 month ago

sebwurl commented 1 month ago

With version 2.0.94 I'm experiencing more entity flipping than expected.

A DSM entity stopped sending data (thresholds are 3600/3600). image

The delay breach at ~19:45 was detected and it flipped to red. At ~00:00 the entity started sending data again. Events arrive with a latency, so it flips to red again. That is what I expect. I don't understand why it keeps flipping to red afterwards until 04:00. image This just shows flipping to red by anomaly reason, to verify that it is always the latency. image

The chart from the first screenshot does not show any long lasting latency. Even when I drill down the underlying search and adjust time timespan to 1m, I don't see a reason to flip to red:

| mstats avg(trackme.splk.feeds.lag_event_sec) as lag_event_sec, avg(trackme.splk.feeds.latest_eventcount_5m) as latest_eventcount_5m, avg(trackme.splk.feeds.latest_dcount_host_5m) as latest_dcount_host_5m, avg(trackme.splk.feeds.avg_latency_5m) as avg_latency_5m wheretrackme_metrics_idx()tenant_id="<my_tenant>" object_category="splk-dsm" object="<myobject>" by object span=5m | timechart span=1m avg(avg_latency_5m) as avg_latency_5m, avg(latest_eventcount_5m) as latest_eventcount_5m, avg(latest_dcount_host_5m) as latest_dcount_host_5m, avg(lag_event_sec) as lag_event_sec | fields _time, avg_latency_5m

image

Do you have an idea why this could happen?

guilhemmarchand commented 1 month ago

Hi @sebwurl

Right that is interesting.

The chart from the first screenshot does not show any long lasting latency. Even when I drill down the underlying search and adjust time timespan to 1m, I don't see a reason to flip to red:

| mstats avg(trackme.splk.feeds.lag_event_sec) as lag_event_sec, avg(trackme.splk.feeds.latest_eventcount_5m) as latest_eventcount_5m, avg(trackme.splk.feeds.latest_dcount_host_5m) as latest_dcount_host_5m, avg(trackme.splk.feeds.avg_latency_5m) as avg_latency_5m where trackme_metrics_idx() tenant_id="" object_category="splk-dsm" object="" by object span=5m | timechart span=1m avg(avg_latency_5m) as avg_latency_5m, avg(latest_eventcount_5m) as latest_eventcount_5m, avg(latest_dcount_host_5m) as latest_dcount_host_5m, avg(lag_event_sec) as lag_event_sec | fields _time, avg_latency_5m

If the reason why it was flipping was latency breaches, these would be more the latency related KPIs than the delay, one guess would be that while backfilling, the source is receiving a mix of real time / quasi real time events and older produced events, can you run this against that period of time:

| mstats 
avg(trackme.splk.feeds.avg_latency_5m) as avg_latency_5m, avg(trackme.splk.feeds.latest_latency_5m) as latest_latency_5m, avg(trackme.splk.feeds.perc95_latency_5m) as perc95_latency_5m, avg(trackme.splk.feeds.stdev_latency_5m) as stdev_latency_5m,
max(trackme.splk.feeds.avg_latency_5m) as max_avg_latency_5m, max(trackme.splk.feeds.latest_latency_5m) as max_latest_latency_5m, max(trackme.splk.feeds.perc95_latency_5m) as max_perc95_latency_5m, max(trackme.splk.feeds.stdev_latency_5m) as max_stdev_latency_5m
where `trackme_metrics_idx(feeds-not-sec)`tenant_id="<mytenant>" object_category="splk-dsm" object="<myobject>" by object span=5m 

| timechart span=5m first(*) as "*"

This would produce/show peaks of latency detected during the period.

Also, it could be interesting to check the summary event:

index=trackme_summary tenant_id="<mytenant>" object_category="splk-dsm" object="<myobject>"
| table _time object anomaly_reason object_state _raw
| sort 0 _time
guilhemmarchand commented 1 month ago

For the last:

index=trackme_summary tenant_id="<mytenant>" object_category="splk-dsm" object="<myobject>"
| table _time object anomaly_reason object_state _raw
| sort 0 _time

In fact we would look at both sourcetypes:

index=trackme_summary sourcetype=trackme:state tenant_id="<mytenant>" object_category="splk-dsm" object="<myobject>"
| table _time object anomaly_reason *state _raw
| sort 0 _time
index=trackme_summary sourcetype=trackme:flip tenant_id="<mytenant>" object_category="splk-dsm" object="<myobject>"
| table _time object anomaly_reason *state _raw
| sort 0 _time
sebwurl commented 1 month ago

Search 1: | mstats avg(trackme.splk.feeds.avg_latency_5m) as avg_latency_5m, avg(trackme.splk.feeds.latest_latency_5m) as latest_latency_5m, avg(trackme.splk.feeds.perc95_latency_5m) as perc95_latency_5m, avg(trackme.splk.feeds.stdev_latency_5m) as stdev_latency_5m, max(trackme.splk.feeds.avg_latency_5m) as max_avg_latency_5m, max(trackme.splk.feeds.latest_latency_5m) as max_latest_latency_5m, max(trackme.splk.feeds.perc95_latency_5m) as max_perc95_latency_5m, max(trackme.splk.feeds.stdev_latency_5m) as max_stdev_latency_5m wheretrackme_metrics_idx(feeds-not-sec)`tenant_id="" object_category="splk-dsm" object="" by object span=5m

| timechart span=5m first() as ""`

It shows only 2 short points in time with a high latency. Last one at 23:50. Afterwards flipping to red because of the latency looks suspicious. image

Search 2 sourcetype="trackme:state" by reason: image

It starts with a delay breach. For the initial backfill then, there are two anomaly_reasons at the same time. When current data arrives again, only the latency reason remains. Although the KPIs don't show longterm breaches in Search 1. Nevertheless, the raw events of sourcetype=trackme:state include latency numbers that indicate a latency breach inside the status message field. Questionable where these values come from.

Search 3 sourcetype="trackme:flip" showing all flips to red by reason: image

guilhemmarchand commented 1 month ago

Thank you @sebwurl this is being reviewed

sebwurl commented 1 month ago

I have another suspicious example, this time it is about the delay. Not whether this is really a correlation with regards of the previous case.

An entity flipped to red 3 times during the last 12h (sourcetype=trackme:flip). image

sourcetype=trackme:state does not show any anomaly_reason during this timerange: image

Looking at the raw events of trackme:flip, one flip was tracked at 02/07/2024 02:56:23.807. Correlating trackme:state, an event at 02/07/2024 02:51:22.962 had a delay of 3458s (while the threshold is 3600). The last state event before the flip event at 02/07/2024 02:56:23.319 has a delay of -2s (so the delay "recovered"). Nevertheless a flip was tracked right afterwards.

Looking at the performance metrics, no threshold breach is visible (3600s): image

Are flips somehow tracked independent of collected metrics and states?

guilhemmarchand commented 1 month ago

@sebwurl

Are flips somehow tracked independent of collected metrics and states?

Flipping events are correlated with metrics and states, however the difference is that the decision maker backend calculates states in real time when it is being called (either by the hybrid trackers or the health tracker every 5 minutes), tehrefore it will be more sensitive to realtime variations.

For instance and regarding the delay breach, the decision maker calculates the current difference between now and the current threshold.

In short, flipping events are created by the hybrid tracker or the healthtracker when the status coming from the decision maker does equal to the status in the KVstore (decision makers says it's red while the current state is green)

It seems in this case it gets hit because we are at the limit or just above the max delay treshold.

In release 2.0.97, I have noticed some inconsistency in the events we generate for flip and state and were we should generate the same convention:

https://github.com/trackme-limited/trackme-report-issues/issues/674 https://github.com/trackme-limited/trackme-report-issues/issues/675

Also I understand you are running 2.0.94, there definitvely were fixes and enhancements ported to the decision maker and other components which may be related, it would be easier that you get the environment upgraded so we can troubleshoot this more easily.

However, I am looking at what conditions could lead to generate too much flip events - althrough in both cases you have raised these remain linked to justified anomalous use cases.

sebwurl commented 2 days ago

Hi, I'm still experiencing this with the current v2.0.99. Notable events for an entity are generated from 14:00 - 18:00, it says because of latency breaches: image

I can't find any indicator for breach after 15:00. Looking at the max latency over time, the breaches ends before 15:00. image

Btw: The Threshold overlay has an own scale, it should share the scale for a correct view.

sebwurl commented 2 days ago

I'm also wondering about multiple notable events between 14:00 - 14:30. It flips red>green>red>green...

From the performance Metrics Tab: | mstats ... | timechart span=5m max(avg_latency_5m) as avg_latency_5m image

-> this shows the a metric for every 10m, but the entity flips to green, when there is no metric. For example at 14:00, 14:10,

guilhemmarchand commented 2 days ago

Hi @sebwurl

All right, let's get the bottom of that ;-)

Questions:

CleanShot 2024-08-21 at 11 25 57@2x

index=trackme_notable (tenant_id="feeds-tracking" AND object="latgen:linux_secure" AND object_category="splk-dsm") 
| table object status_message

Also, and maybe even better, we can get these from the summary state messages:

index=trackme_summary (tenant_id="feeds-tracking" AND object="latgen:linux_secure" AND object_category="splk-dsm")
| table object status_message

Let's start from the status_message which will 100% sure describe the issue.

sebwurl commented 2 days ago

I combined both searches for a chronological view: image

The alert is unsticky, so flipping back to green is expected when the green state is correct.

guilhemmarchand commented 2 days ago

Thank you @sebwurl Reviewing this

guilhemmarchand commented 1 day ago

@sebwurl

To me, these show justified detection of latency anomalies, what is confusing here is that eventually the performance metrics might miss due to calculation and aggregation the peaks TrackMe detects.

What I mean, TrackMe's trackers use tstats and get the max latency in a per second basic then aggregate per minute, similarly to:

| tstats max(_indextime) as data_last_ingest, min(_time) as data_first_time_seen, max(_time) as data_last_time_seen, count as data_eventcount, dc(host) as host where index=* _index_earliest="-4h" _index_latest="+4h" by _time,index,sourcetype span=1s
| eval data_last_ingestion_lag_seen=data_last_ingest-data_last_time_seen
``` intermediate calculation ```
| bucket _time span=1m
| stats avg(data_last_ingestion_lag_seen) as data_last_ingestion_lag_seen by _time,index,sourcetype
| eval spantime=data_last_ingest | eventstats avg(data_last_ingestion_lag_seen) as latency_5m, avg(dcount_host) as dcount_host_5m by spantime,index,sourcetype
| stats avg(data_last_ingestion_lag_seen) as data_last_ingestion_lag_seen by index,sourcetype | eval dcount_host=round(global_dcount_host, 0)
 | eval data_last_ingestion_lag_seen=round(data_last_ingestion_lag_seen, 0) | rename index as data_index, sourcetype as data_sourcetype | eval object=data_index . ":" . data_sourcetype
``` tenant_id ```
| eval tenant_id="feeds-tracking"

``` extracted from the abstract macro ```

``` Use TrackMe avg_latency_5m metric here ```
| eval data_last_ingestion_lag_seen=if(isnum(avg_latency_5m), avg_latency_5m, data_last_ingestion_lag_seen)

It is 100% sure that if you would run the logic in the time scope for this entity scope, you would get the latency outliers.

I am revewing if we can improve / review the aggregation which might be reduce these peaks once in the evenstats part.

sebwurl commented 1 day ago

I just tried to investigate based on raw events. Shouldn't all breaches show up in search like the one below? We narrow down the relevant time range with _index_earliest & index_latest, calculate the latency per event and then plot the max latency per indextime span into a chart. This looks like the performance metrics tab shown above. No indicator for alerts until 18:30. I can't think of a more granular check than this one, as this is event based using a maximum.

index=myindex sourcetype=mysourcetype _index_earliest=08/17/2024:00:00:00 _index_latest=08/17/2024:19:00:00 | eval latency=_indextime-_time | eval _time=_indextime | timechart max(latency) span=5m

image

guilhemmarchand commented 6 hours ago

@sebwurl

This is correct, the raw search like this one is the 100% accurate search, doesn't it show up events with high latency? (from your chart it seems to be the case)

To make things simple and when looking at latency in a raw search, I like to do:

index="latgen" sourcetype="linux_secure"
| eval latency=_indextime-_time
| eval indextime=strftime(_indextime, "%c")
| eval latency_duration=tostring(latency, "duration")

| sort - 0 latency
| table _time, indextime, latency, latency_duration, _raw

What TrackMe does via tstats is less accurate but accurate enough to capture anomalies, like:

| tstats max(_indextime) as indextime where index="latgen" sourcetype="linux_secure" by _time, index, sourcetype span=1s
| eval latency=indextime-_time
``` above is the per second table ```

| bucket _time span=1m
| stats avg(latency) as latency, max(indextime) as indextime by _time
``` above is an aggregation ```

| eval indextime=strftime(indextime, "%c")
| eval latency_duration=tostring(latency, "duration")

| sort - 0 latency
| table _time, indextime, latency, latency_duration
sebwurl commented 6 hours ago

This is correct, the raw search like this one is the 100% accurate search, doesn't it show up events with high latency? (from your chart it seems to be the case)

Yes, there are high latency events until 14:30. But TrackMe Alerts are fired until 18:30. That is what I did not understand. There was the entity flipping in between as well.

Slightly changing the final commands of your second search for visibility, it looks the same: Breaches only until 14:30. image

guilhemmarchand commented 5 hours ago

One thing you can try too is inverting the _time/_indextime definition in your search

index="latgen" sourcetype="linux_secure" | eval _time=_indextime

Then against it.

I get your points however, it's difficult to answer properly as is, one thing I can propose is to meet next week in a live meeting and review this together.