influxdata / kapacitor

Open source framework for processing, monitoring, and alerting on time series data
MIT License
2.31k stars 492 forks source link

Only 1 event is fired after reload task forcing to change to OK status N active events #2061

Open sbengo opened 6 years ago

sbengo commented 6 years ago

Hi,

We have been working with Kapacitor to generate alerts based on metrics threshold (simple ones) on:

SO: RHEL 7.4 Kapacitor: Kapacitor OSS 1.5.0 (git: HEAD 4f10efc41b4dcac070495cf95ba2c41cfcc2aa3a)

Overview

We have some TICKScripts that fires N events, based on working cardinality of the alert node, so the N events can be changing his own state based on the threshold.

The problem seems to appear when we change the TICKScript and we reload the task, forcing the OK of the N events

Actual behaviour

After reload the task with new thresholds to force the OK on the N events, only 1 event is fired to OK and the other N-1 events seems to be 'lost' and considered as OK, but no OK event is fired.

Expected behaviour

After reload the task with new thresholds to force the OK on the N events, the N events are fired to OK.

Detailed case

To allow you to repro the case, I have written a TICKScript and a brief table with actions and events fired:

TICKSCRIPT

var ID = 'ticks_cpu'
var FIELD = 'usage-idle'
var FIELD_DEFAULT = 0.0
var TH_CRIT_DEF = 0.0
var TH_WARN_DEF = 0.0
var TH_INFO_DEF = 0.0

// TICKSCRIPT:
// ================
// var data = stream
stream
    |from()
        .database('telegraf')
        .retentionPolicy('autogen')
        .measurement('cpu')
        .groupBy(*)
    |default()
        .field(FIELD, FIELD_DEFAULT)
    |eval()
        .keep(FIELD)
    |window()
        .period(1m)
        .every(10s)
        .align()
    |mean(FIELD)
        .as('value')
    |alert()
        .crit(lambda: float("value") < TH_CRIT_DEF)
        .warn(lambda: float("value") < TH_WARN_DEF)
        .info(lambda: float("value") < TH_INFO_DEF)
        .id(ID)
        .log('/tmp/test-cpu.log')

Actions and results

On the following table, it is shown the actions and the events results.

As it is shown, after forcing an OK on already N CRIT events, it only fires a single OK event

Step Action #Cores #Actual Events Expected result Example
1 -Start cpu stress on host. TICKScript is not modified 2+1 (cpu-total) 3 OK CRIT: Series – cpu0/myhost CRIT: Series – cpu1/myhost CRIT: Series – cpu-total/myhost
2 Stop cpu stress on host. TickScript is not modified 2+1 (cpu-total) 3 OK OK: Series  - cpu0/myhost OK: Series – cpu1/myhost OK: Series – cpu-total/myhost
3 Modify TICKScript, setup threshold to fire CRITS 2+1 (cpu-total) 3 OK CRIT: Series – cpu0/myhost CRIT: Series – cpu1/myhost CRIT: Series – cpu-total/myhost
4 Modify TICKscript, setup threshold to fire OK after CRITS 2+1 (cpu-total) 1 NOOK OK: Series – (RANDOM?)/myhost
sbengo commented 6 years ago

To add more info, in our case, we want to send alerts in our production environment and, as I explained on the first comment, it is generating an alert for each event.

If the TICKscript is changed, those alerts persist on our monitoring system with the PreviousLevel as it doesn't have received its OK event.

@nathanielc , @desa , can you review it please?

Thanks, Greetings!