influxdata / kapacitor

Open source framework for processing, monitoring, and alerting on time series data
MIT License
2.32k stars 492 forks source link

Understanding Kapacitor’s deadman Act #1433

Open zargex opened 7 years ago

zargex commented 7 years ago

Hi, I'm using telegraf to supervise some process. So I thought it could be interesting if kapacitor is able to notify me when any process die.

I think this can be done with Kapacitor's deadman switch (if a process die, telegraf's procstat input can't send data to influxdb). But i'm having troubles with this, if I kill the process, Kapacitor will send the alert, but when I restart the process, Kapacitor still sending the alert.

I'm trying something like this:

|deadman(1.0, 10s)
.slack()
.channel('#alerts-staging')

If no point has been arrived in 10s, send the alert ( I think it works like that).

But executing kapacitor show in the task, I can see graph [throughput="0.00 points/s"]; I don't know if this thoughput matters.

I'm using kapacitor 1.3.1-1 on Debian 8.

Thanks.

nathanielc commented 7 years ago

@zargex Can you share the first part of the TICKscript as well, the part before the deadman?

zargex commented 7 years ago
stream
  |from()
    .measurement('procstat')
    .where(lambda: "pidfile" == '/tmp/svc-sendgrid-subscriber.pid')
    .where(lambda: "host" == 'localhost')
  |eval(lambda: float("memory_rss") / (1024.0*1024.0))
    .as('memory_rss')
  |alert()
    .id('sendgrid-subscriber MEM USAGE Alert')
    .message('The sendgrid-subscriber mem usage is {{.Level}} on host: {{ index .Tags "host" }}, the sendgrid-subscriber is using this amount of ram : {{ index .Fields "memory_rss" }} MB')
    .crit(lambda: "memory_rss" >  512)
    .warn(lambda: "memory_rss" >  256)
    .info(lambda: "memory_rss" >  100)
    .slack()
    .channel('#alerts-staging')
    .stateChangesOnly(10m)

I put the deadman after the from section and before the evalsection.

adityacs commented 7 years ago

@zargex Could you please try below script?

stream
  |from()
    .database('telegraf')
    .retentionPolicy('autogen')
    .measurement('procstat')
    .where(lambda: "pidfile" == '/tmp/svc-sendgrid-subscriber.pid')
    .where(lambda: "host" == 'localhost')
  |deadman(1.0, 10s)
    .slack()
    .channel('#alerts-staging')
  |eval(lambda: float("memory_rss") / (1024.0*1024.0))
    .as('memory_rss')
  |alert()
    .id('sendgrid-subscriber MEM USAGE Alert')
    .message('The sendgrid-subscriber mem usage is {{.Level}} on host: {{ index .Tags "host" }}, the sendgrid-subscriber is using this amount of ram : {{ index .Fields "memory_rss" }} MB')
    .crit(lambda: "memory_rss" >  512)
    .warn(lambda: "memory_rss" >  256)
    .info(lambda: "memory_rss" >  100)
    .slack()
    .channel('#alerts-staging')
    .stateChangesOnly(10m)
zargex commented 7 years ago

@adityacs I tried what you proposed, but I only got notification saying the alert is dead. What I undertand is that my throughput is very low, thus the deadman switch is triggered.

If I use the kapacitor's show command in that alert, I get: graph [throughput="0.00 points/s"]; but sometimes I get graph [throughput="18.00 points/s"];

Telegraf is using a default interval of 10 seconds for all plugins. Maybe if I reduce this interval Kapacitor will work as I expected

Marbaf commented 4 years ago

The aforementioned proposition didn't work with Kapacitor 1.5. But this works :

 var data =  
   stream
     |from()
       .measurement('cpu')
       .groupBy(*)
data     
  |alert()
    .crit(lambda: "usage_idle" < 10)
    .topic('cpu')

data   
  |deadman(threshold, interval)

From https://stackoverflow.com/questions/45556226/how-to-add-a-deadmans-switch-to-an-existing-alert