Procstat alerts on all nodes when one influx enterprise node goes down

cl0udgeek commented 7 years ago

I have a tick script that looks like this:

batch
    |query('''
            SELECT count(pid) as count
            from "sdp_monitoring"."13m"."procstat"
            where "pattern" = 'kafka'
            ''')
        .groupBy('cluster', 'host')
        .period(45s)
        .every(10s)
    |deadman(0.0, 10s)
        .stateChangesOnly()
        .exec('/kapacitor/tick-scripts/hpomi.py', 'Kafka Process Down Alert', 'Kafka Process Down', 'Kafka Process Down')

and telegraf input that has this:

[agent]
  interval = "10s"

[global_tags]
  cluster = "{{ kafka_cluster }}"
  broker = "{{ ansible_hostname }}"

[[inputs.procstat]]
  pattern = "kafka"

[[inputs.cpu]]
  percpu = true
  totalcpu = true

[[inputs.disk]]

[[inputs.diskio]]

[[inputs.processes]]

[[inputs.mem]]

[[inputs.system]]

[[outputs.influxdb]]
  urls = [ "{{ influx_cluster }}" ]
  database = "monitoring"
  retention_policy = "13m"
  precision = "s"

I'm also running a 7 node influx cluster (3 meta and 4 data) and have a replication policy on the monitoring db as 2. I've simulated that when I take on influx node down, Kapacitor will throw a process down alert and then right after throw a process up alert on state change. The processes are fine, looks like its being triggered by just taking one influxdb node down.

desa commented 7 years ago

@k1ng87 Thanks for the report. I've got a couple asks

Does this happen if any one of the data nodes is taken down? If not, is there a specific one that causes the issue?
Would it be possible to see your Kapacitor configuration? The section on InfluxDB in particular.

My current thinking is that Kapacitor is configured against a specific InfluxDB data node in the cluster. And so when that node goes down and is no longer reachable, it triggers the deadman.

One way to verify that this is the case would be to check the Kapacitor logs. If you see logs like the following

[<task_name>:query1] 2017/08/23 10:33:34 E! Post ....

That would be a good indicator.

If this is indeed the case, a quick fix would be to point Kapacitor to the load balancer in front of the data nodes.

The longer term fix, would be to have Kapacitor do some kind of service discovery for all of the data nodes in the cluster.

cl0udgeek commented 7 years ago

@desa its happening when any node goes down. I have kapacitor pointed to an route 53 record that sits in front of any ELB.

desa commented 7 years ago

@k1ng87 That rules that out. When you take down the data node, do you remove that node from the LB?

cl0udgeek commented 7 years ago

nope...just do a service stop

cl0udgeek commented 7 years ago

trying to simulate a failure...it happened when one of my data nodes (ec2 instance) went corrupt.

desa commented 7 years ago

I see. I think I know whats happening.

The process is stopped.
Eventually Kapacitor queries the node that went down since it was still in the LB.
That query fails and results in the deadman being triggered in a CRIT
The next query that comes up succeeds and it returns back to OK

There are a couple of things that I can think of that will somewhat remedy the issue in the short term.

Increase the time of the deadman to be double that of the every from the query. This way we'll be resilient to the failure of one node going down.
Have an external service that pings each of the data nodes in the cluster to see if they're still respond. If they don't respond, trigger an alert, and remove them from the LB.

Both of these are temporary fixes. We'll need a better solution long term, but this should mend things temporarily.

cl0udgeek commented 7 years ago

so right now its this:

batch
    |query('''
            SELECT count(pid) as count
            from "sdp_monitoring"."13m"."procstat"
            where "pattern" = 'kafka'
            ''')
        .groupBy('cluster', 'host')
        .period(45s)
        .every(10s)
    |deadman(0.0, 10s)
        .stateChangesOnly()
        .exec('/kapacitor/tick-scripts/hpomi.py', 'Kafka Process Down Alert', 'Kafka Process Down', 'Kafka Process Down')

so change the deadman or the query?

also, telegraf is sending the procstat input every 10s

desa commented 7 years ago

I'd try changing the deadman

batch
    |query('''
            SELECT count(pid) as count
            from "sdp_monitoring"."13m"."procstat"
            where "pattern" = 'kafka'
            ''')
        .groupBy('cluster', 'host')
        .period(45s)
        .every(10s)
    |deadman(0.0, 20s)
        .stateChangesOnly()
        .exec('/kapacitor/tick-scripts/hpomi.py', 'Kafka Process Down Alert', 'Kafka Process Down', 'Kafka Process Down')

cl0udgeek commented 7 years ago

So gave that a shot but it happened again last night in one of our QA environments...

desa commented 7 years ago

@k1ng87 hmm. My reasoning was a bit off previously. I was assuming, mistakenly, that there was only kapacitor interacting with the cluster.

I'm still pretty sure the issue is related to the LB. I'll do some work today to try to confirm my theory. I'm curious to know if the problem still exists if I remove the downed node from the LB.

My plan is the following

Start a 3 data node cluster
Point the LB at two of the data nodes
Set up the Kapacitor task
Kill the data node that doesn't have the LB pointing at it
See if deadman is triggered

My guess is that the deadman will not be triggered. And the reason why we're seeing the deadman trigger is that the batch queries keep hitting the same node.

Did you notice if there were any logs like

[<task_name>:query<n>] 2017/08/23 10:33:34 E! Post ....

around the time of the deadman trigger?

Looking at the task a bit more, I think it might be possible to convert this task to a stream task

stream
    |from()
      .database('sdp_monitoring')
      .retentionPolicy('13m')
      .measurement('procstat')
      .where(lambda: "pattern" == 'kafka')
      .groupBy('cluster', 'host')
    |window()
      .period(45s)
      .every(10s)
    |count('pid')
      .as('pid')
    |deadman(0.0, 20s)
        .stateChangesOnly()
        .exec('/kapacitor/tick-scripts/hpomi.py', 'Kafka Process Down Alert', 'Kafka Process Down', 'Kafka Process Down')

Using a stream instead of a batch will remove the LB from the equation.

cl0udgeek commented 7 years ago

so I did upgrade to the current release of influx and taking a node down did not trigger deadman to go off.....but for some reason, deadman got triggered on the env last night...

I did search for this E! Post in the kapacitor logs but did not see anything in there...

will try to changeit to a stream too and see what happens...

desa commented 7 years ago

I'd definitely recommend trying the stream task.

cl0udgeek commented 7 years ago

going to try it but can you help me understand why the logic would run differently in a stream vs batch?

cl0udgeek commented 7 years ago

so I just tried it an it seems the streams one is missing the tags output in the data when it alerts...no bueno :-(

in batch..the data object looks liek this:

"data": {
    "series": [
      {
        "values": [
          [
            "2017-06-20T15:05:06.708243035Z", 
            18.38074168678113
          ]
        ], 
        "name": "cpu", 
        "columns": [
          "time", 
          "mean"
        ], 
        "tags": {
          "cluster": "influx", 
          "host": "ip-5.dqa.domain.com"
        }
      }
    ]
  }

in the stream it looks like this:

"data": {
    "series": [
      {
        "values": [
          [
            "2017-08-25T15:22:00Z", 
            0
          ]
        ], 
        "name": "stats", 
        "columns": [
          "time", 
          "emitted"
        ]
      }
    ]
  }

seems to be missing the tags part...

desa commented 7 years ago

Here's a reference with a bit more detail https://community.influxdata.com/t/batch-versus-stream-processing-combine/792/2

The difference between a stream and a batch is that with a batch task, Kapacitor issues queries to InfluxDB, and with a stream task, all writes to the InfluxDB cluster are mirrored to Kapacitor.

My running theory is that since the downed node is still a part of the LB, it's possible for Kapacitor to issue queries to the downed node, which would result in the deadman firing.

A stream task would remove the LB from the equation since any node if the cluster that receives a write will mirror that write to Kapacitor.

desa commented 7 years ago

so I just tried it an it seems the streams one is missing the tags output in the data when it alerts...no bueno :-(

Not sure I understand what this means. Can you show me the resulting alert data.

cl0udgeek commented 7 years ago

yeah sorry...just made an edit to the post showing it...also...I'm on this version Kapacitor 1.3.1 (git: master 3b5512f7276483326577907803167e4bb213c613)

desa commented 7 years ago

Was this alert data

"data": {
    "series": [
      {
        "values": [
          [
            "2017-06-20T15:05:06.708243035Z", 
            18.38074168678113
          ]
        ], 
        "name": "cpu", 
        "columns": [
          "time", 
          "mean"
        ], 
        "tags": {
          "cluster": "influx", 
          "host": "ip-5.dqa.domain.com"
        }
      }
    ]
  }

generated from a deadman?

cl0udgeek commented 7 years ago

so it just happened again and this is what I see in the kapacitor logs when it happened:

[uncleanelection_alert:query1] 2017/08/25 15:38:40 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[connectproc_alert:query1] 2017/08/25 15:38:44 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'connect' AND time >= '2017-08-25T15:37:59.949575339Z' AND time < '2017-08-25T15:38:44.949575339Z' GROUP BY cluster, host
[connectproc_alert:query1] 2017/08/25 15:38:44 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[cpu_alert:query1] 2017/08/25 15:38:44 D! starting next batch query: SELECT mean(usage_user) FROM sdp_monitoring."13m".cpu WHERE time >= '2017-08-25T15:38:14.955109092Z' AND time < '2017-08-25T15:38:44.955109092Z' GROUP BY cluster, host
[cpu_alert:query1] 2017/08/25 15:38:44 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[disk_alert:query1] 2017/08/25 15:38:44 D! starting next batch query: SELECT max(used_percent) FROM sdp_monitoring."13m".disk WHERE time >= '2017-08-25T15:33:44.966570636Z' AND time < '2017-08-25T15:38:44.966570636Z' GROUP BY cluster, host
[disk_alert:query1] 2017/08/25 15:38:44 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[mem_alert:query1] 2017/08/25 15:38:44 D! starting next batch query: SELECT mean(used_percent) FROM sdp_monitoring."13m".mem WHERE time >= '2017-08-25T15:37:44.993644144Z' AND time < '2017-08-25T15:38:44.993644144Z' GROUP BY cluster, host
[mem_alert:query1] 2017/08/25 15:38:45 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[schemaproc_alert:query1] 2017/08/25 15:38:45 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'schemar' AND time >= '2017-08-25T15:38:00.004506299Z' AND time < '2017-08-25T15:38:45.004506299Z' GROUP BY cluster, host
[schemaproc_alert:query1] 2017/08/25 15:38:45 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[schemarproc_alert:query1] 2017/08/25 15:38:45 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'schemar' AND time >= '2017-08-25T15:38:00.010261082Z' AND time < '2017-08-25T15:38:45.010261082Z' GROUP BY cluster, host
[schemarproc_alert:query1] 2017/08/25 15:38:45 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[zookeeperproc_alert:query1] 2017/08/25 15:38:45 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'zookeeper' AND time >= '2017-08-25T15:38:00.038542172Z' AND time < '2017-08-25T15:38:45.038542172Z' GROUP BY cluster, host
[zookeeperproc_alert:query1] 2017/08/25 15:38:45 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[kafkaproc_alert:query1] 2017/08/25 15:38:48 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'kafka' AND time >= '2017-08-25T15:38:03.070066125Z' AND time < '2017-08-25T15:38:48.070066125Z' GROUP BY cluster, host
[kafkaproc_alert:query1] 2017/08/25 15:38:48 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[kafka_upr_alert:query1] 2017/08/25 15:38:48 D! starting next batch query: SELECT mean(metric_value_number) FROM sdp_monitoring."13m".BrokerMetrics WHERE kafka_metric_name = 'UnderReplicatedPartitions' AND time >= '2017-08-25T15:38:08.585839971Z' AND time < '2017-08-25T15:38:48.585839971Z' GROUP BY cluster, host
[kafka_upr_alert:query1] 2017/08/25 15:38:48 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[offline_partition_alert:query1] 2017/08/25 15:38:49 D! starting next batch query: SELECT mean(metric_value_number) FROM sdp_monitoring."13m".BrokerMetrics WHERE kafka_metric_name = 'OfflinePartitionsCount' AND time >= '2017-08-25T15:38:09.701383591Z' AND time < '2017-08-25T15:38:49.701383591Z' GROUP BY cluster, host
[offline_partition_alert:query1] 2017/08/25 15:38:49 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[uncleanelection_alert:query1] 2017/08/25 15:38:50 D! starting next batch query: SELECT mean(metric_value_number) FROM sdp_monitoring."13m".BrokerMetrics WHERE kafka_metric_name = 'UncleanLeaderElectionsPerSec' AND time >= '2017-08-25T15:38:10.593466722Z' AND time < '2017-08-25T15:38:50.593466722Z' GROUP BY cluster, host
[uncleanelection_alert:query1] 2017/08/25 15:38:50 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[connectproc_alert:query1] 2017/08/25 15:38:54 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'connect' AND time >= '2017-08-25T15:38:09.949580921Z' AND time < '2017-08-25T15:38:54.949580921Z' GROUP BY cluster, host
[connectproc_alert:query1] 2017/08/25 15:38:54 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[cpu_alert:query1] 2017/08/25 15:38:54 D! starting next batch query: SELECT mean(usage_user) FROM sdp_monitoring."13m".cpu WHERE time >= '2017-08-25T15:38:24.955099869Z' AND time < '2017-08-25T15:38:54.955099869Z' GROUP BY cluster, host
[cpu_alert:query1] 2017/08/25 15:38:54 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[disk_alert:query1] 2017/08/25 15:38:54 D! starting next batch query: SELECT max(used_percent) FROM sdp_monitoring."13m".disk WHERE time >= '2017-08-25T15:33:54.966570028Z' AND time < '2017-08-25T15:38:54.966570028Z' GROUP BY cluster, host
[disk_alert:query1] 2017/08/25 15:38:54 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[mem_alert:query1] 2017/08/25 15:38:54 D! starting next batch query: SELECT mean(used_percent) FROM sdp_monitoring."13m".mem WHERE time >= '2017-08-25T15:37:54.993640223Z' AND time < '2017-08-25T15:38:54.993640223Z' GROUP BY cluster, host
[mem_alert:query1] 2017/08/25 15:38:54 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[schemaproc_alert:query1] 2017/08/25 15:38:55 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'schemar' AND time >= '2017-08-25T15:38:10.004497407Z' AND time < '2017-08-25T15:38:55.004497407Z' GROUP BY cluster, host
[schemarproc_alert:query1] 2017/08/25 15:38:55 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'schemar' AND time >= '2017-08-25T15:38:10.010253585Z' AND time < '2017-08-25T15:38:55.010253585Z' GROUP BY cluster, host
[schemaproc_alert:query1] 2017/08/25 15:38:55 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[schemarproc_alert:query1] 2017/08/25 15:38:55 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[zookeeperproc_alert:query1] 2017/08/25 15:38:55 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'zookeeper' AND time >= '2017-08-25T15:38:10.03853013Z' AND time < '2017-08-25T15:38:55.03853013Z' GROUP BY cluster, host
[zookeeperproc_alert:query1] 2017/08/25 15:38:55 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[kafkaproc_alert:query1] 2017/08/25 15:38:58 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'kafka' AND time >= '2017-08-25T15:38:13.070055764Z' AND time < '2017-08-25T15:38:58.070055764Z' GROUP BY cluster, host
[kafkaproc_alert:query1] 2017/08/25 15:38:58 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[kafka_upr_alert:query1] 2017/08/25 15:38:58 D! starting next batch query: SELECT mean(metric_value_number) FROM sdp_monitoring."13m".BrokerMetrics WHERE kafka_metric_name = 'UnderReplicatedPartitions' AND time >= '2017-08-25T15:38:18.585845744Z' AND time < '2017-08-25T15:38:58.585845744Z' GROUP BY cluster, host
[kafka_upr_alert:query1] 2017/08/25 15:38:58 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[offline_partition_alert:query1] 2017/08/25 15:38:59 D! starting next batch query: SELECT mean(metric_value_number) FROM sdp_monitoring."13m".BrokerMetrics WHERE kafka_metric_name = 'OfflinePartitionsCount' AND time >= '2017-08-25T15:38:19.701395415Z' AND time < '2017-08-25T15:38:59.701395415Z' GROUP BY cluster, host
[offline_partition_alert:query1] 2017/08/25 15:38:59 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused

all of those IPs are influx nodes...

naw...but just an example...that was a CPU one but deadman in batch includes the tags part too.

desa commented 7 years ago

Whats the replication factor on sdp_monitoring."13m"

cl0udgeek commented 7 years ago

have a replication factor of 2....running 4 data nodes and 3 meta nodes...

desa commented 7 years ago

@k1ng87 Just tested the stream version myself, and there's an initial trigger of the deadman that doesn't have tags, but that goes away after the deadman starts receiving data. Can you confirm that this is not the behavior you're seeing.

Before data:

"data":
{
  "series": [
    {
      "name": "stats",
      "columns": [
        "time",
        "emitted"
      ],
      "values": [
        [
          "2017-08-25T16:18:10Z",
          0
        ]
      ]
    }
  ]
}

After data

"data":
{
  "series": [
    {
      "name": "stats",
      "tags": {
        "cluster_id": "michaels-example-cluster",
        "cpu": "cpu-total",
        "host": "Michaels-MBP-2.router.edm"
      },
      "columns": [
        "time",
        "emitted"
      ],
      "values": [
        [
          "2017-08-25T16:03:30Z",
          1
        ]
      ]
    }
  ]
}

desa commented 7 years ago

Were those logs from the time when a data node was downed? Or just during normal runtime?

cl0udgeek commented 7 years ago

no, the data nodes were up and running...that happened when I started to see kapacitor throw alerts on deadman...that was also when the tick scripts were running in batch mode but now redeploying to run in stream mode

----EDIT------

correction...those two nodes were down...just didn't notice it before until a coworker pointed out the health check...

cl0udgeek commented 7 years ago

so I switched the tick scripts to do streams now but now I'm not getting alerts...I see that it is subscribed to influx and the tasks are running but don't see any alerts come through.

desa commented 7 years ago

What's the output of

kapacitor show <task name>

This should show us where things are getting hung up.

cl0udgeek commented 7 years ago

nvm....configured the hostname wrong in the kapacitor config...working now

cl0udgeek commented 7 years ago

I'd like to open this backup as this still keeps happening...I did notice that kapacitor was maxing out the CPU so we bumped the ec2 to a c4.4xlarge but now they keep happening sporadically where all of the deadman alerts go off.

The ec2 runs on an ASG and if I terminate the ec2 and let ASG build a new one it works fine for some time but the then the behavior comes back. Looked at disk space and memory and those are fine too and influx is healthy as well but still unsure why this is still happening.

influxdata / kapacitor

Procstat alerts on all nodes when one influx enterprise node goes down #1532