influxdata / chronograf

Open source monitoring and visualization UI for the TICK stack
https://www.influxdata.com/time-series-platform/chronograf/
Other
1.51k stars 258 forks source link

When adding email handler to kapacitor rule TICK script is wrong #4331

Closed stefanhorning closed 5 years ago

stefanhorning commented 6 years ago

When adding an email handler to an alert rule using Chronograf the resulting TICK script is buggy and causes the alert message to be sent out every second or so.

Upon closer looks I noticed the line 0s in the tick script which is only being added when an email handler is added through the Chronograf GUI

So with more context the begining of the TICK script looks somewhat like this

var db = 'telegraf'

var rp = 'metrics'

var measurement = 'disk'

var groupBy = ['host_role']

var whereFilter = lambda: TRUE

var period = 5m

0s

var every = 30s

var name = 'Test'

var idVar = name + ':{{.Group}}'

var message = 'Test alert'

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'threshold'

var crit = 70

Somehow I didn't manage to repair the TICK script by just removing the 0s line as kapacitor would just continue to spit out alerts (at least to the slack channel we also had as a second handler). Even disabling/enabling the rule and restarting kapacitor didn't help. But when creating a fresh rule without an email handler everything seems fine again.

Let me know if I rather should report this problem to the Kapacitor project, but to me it looks like the bug is in the way the TICK script is generated by Chronograf.

dataviruset commented 6 years ago

I also wanna know if these "0s" are problems or not. After creating alert rules in Chronograf they show up here and there in the TICKscripts.

stefanhorning commented 6 years ago

@russorat As you are usually quite quick in responding to tickets, I believe this one fell through the cracks. Could you please leave a comment here to see how to further deal with this issue, as it currently blocks us from adding email alerts. Thanks!

russorat commented 6 years ago

@stefanhorning sorry for missing this. i can't seem to recreate your issue. could you describe the steps to reproduce?

stefanhorning commented 6 years ago

Thanks for your quick reply.

Ok, so it took me a bit myself this time. It seems to be a combination of various things is leading to this.

First my preconditions / environment:

To reproduce go to Alerts/Tasks page and create new alert rule (editing existing should also work), using the Alert rule builder:

  1. In the alert rule make sure you group metrics by some tag (otherwise normal threshold alert)
  2. Add Slack alert handler (not sure if necessary)
  3. Add email alert handler
  4. Save

Open the same rule in the TICK editor and you should now find the 0s line in there, which will cause the alerts to go crazy (once the alert treshold is crossed).

For easier debugging, here the entire TICK script I created today (through the GUI) following above steps:

var db = 'telegraf'

var rp = 'metrics'

var measurement = 'disk'

var groupBy = ['host_role']

var whereFilter = lambda: TRUE

var period = 1m

0s

var every = 30s

var name = 'Test bug'

var idVar = name + ':{{.Group}}'

var message = 'Test alerting bug.'

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'threshold'

var crit = 90

var data = stream
    |from()
        .database(db)
        .retentionPolicy(rp)
        .measurement(measurement)
        .groupBy(groupBy)
        .where(whereFilter)
    |window()
        .period(period)
        .every(every)
        .align()
    |min('used_percent')
        .as('value')

var trigger = data
    |alert()
        .crit(lambda: "value" > crit)
        .message(message)
        .id(idVar)
        .idTag(idTag)
        .levelTag(levelTag)
        .messageField(messageField)
        .durationField(durationField)
        .email()
        .to('foo@bar.com')
        .slack()
        .channel('#operations')

trigger
    |eval(lambda: float("value"))
        .as('value')
        .keep()
    |influxDBOut()
        .create()
        .database(outputDB)
        .retentionPolicy(outputRP)
        .measurement(outputMeasurement)
        .tag('alertName', name)
        .tag('triggerType', triggerType)

trigger
    |httpOut('output')

Hope this helps!

russorat commented 6 years ago

@stefanhorning i think the alert going crazy might be more related to the fact that we are not adding a "stateChangesOnly" option to the alert trigger. We've fixed an issue related to that before but i wonder if it has been introduced.

the 0s should also be on the line above which is also strange, but shouldn't be detrimental to the script execution IMO, although i haven't verified.

stefanhorning commented 6 years ago

Ok, I will try out if adding stateChangesOnly fixes the issue. Will get back to you if I have some results.

stefanhorning commented 6 years ago

Yes, you are right, I compared with alert rules with only one alert handler and they all have the .stateChangesOnly() method right before the handler. When adding it to the rule with the two handlers manually the issue with too many alerts seems to be resolved.

So I guess we can close this tickets and derive two new ones from it for the Chronograf TICK generating logic:

So feel free to close this issue if those issues have been addressed.

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 5 years ago

This issue has been automatically closed because it has not had recent activity. Feel free to reopen if this issue is still important to you. Thank you for your contributions.