Open cl0udgeek opened 7 years ago
@k1ng87 Thanks for the report. I've got a couple asks
My current thinking is that Kapacitor is configured against a specific InfluxDB data node in the cluster. And so when that node goes down and is no longer reachable, it triggers the deadman.
One way to verify that this is the case would be to check the Kapacitor logs. If you see logs like the following
[<task_name>:query1] 2017/08/23 10:33:34 E! Post ....
That would be a good indicator.
If this is indeed the case, a quick fix would be to point Kapacitor to the load balancer in front of the data nodes.
The longer term fix, would be to have Kapacitor do some kind of service discovery for all of the data nodes in the cluster.
@desa its happening when any node goes down. I have kapacitor pointed to an route 53 record that sits in front of any ELB.
@k1ng87 That rules that out. When you take down the data node, do you remove that node from the LB?
nope...just do a service stop
trying to simulate a failure...it happened when one of my data nodes (ec2 instance) went corrupt.
I see. I think I know whats happening.
There are a couple of things that I can think of that will somewhat remedy the issue in the short term.
every
from the query. This way we'll be resilient to the failure of one node going down.Both of these are temporary fixes. We'll need a better solution long term, but this should mend things temporarily.
so right now its this:
batch
|query('''
SELECT count(pid) as count
from "sdp_monitoring"."13m"."procstat"
where "pattern" = 'kafka'
''')
.groupBy('cluster', 'host')
.period(45s)
.every(10s)
|deadman(0.0, 10s)
.stateChangesOnly()
.exec('/kapacitor/tick-scripts/hpomi.py', 'Kafka Process Down Alert', 'Kafka Process Down', 'Kafka Process Down')
so change the deadman or the query?
also, telegraf is sending the procstat
input every 10s
I'd try changing the deadman
batch
|query('''
SELECT count(pid) as count
from "sdp_monitoring"."13m"."procstat"
where "pattern" = 'kafka'
''')
.groupBy('cluster', 'host')
.period(45s)
.every(10s)
|deadman(0.0, 20s)
.stateChangesOnly()
.exec('/kapacitor/tick-scripts/hpomi.py', 'Kafka Process Down Alert', 'Kafka Process Down', 'Kafka Process Down')
So gave that a shot but it happened again last night in one of our QA environments...
@k1ng87 hmm. My reasoning was a bit off previously. I was assuming, mistakenly, that there was only kapacitor interacting with the cluster.
I'm still pretty sure the issue is related to the LB. I'll do some work today to try to confirm my theory. I'm curious to know if the problem still exists if I remove the downed node from the LB.
My plan is the following
My guess is that the deadman will not be triggered. And the reason why we're seeing the deadman trigger is that the batch queries keep hitting the same node.
Did you notice if there were any logs like
[<task_name>:query<n>] 2017/08/23 10:33:34 E! Post ....
around the time of the deadman trigger?
Looking at the task a bit more, I think it might be possible to convert this task to a stream task
stream
|from()
.database('sdp_monitoring')
.retentionPolicy('13m')
.measurement('procstat')
.where(lambda: "pattern" == 'kafka')
.groupBy('cluster', 'host')
|window()
.period(45s)
.every(10s)
|count('pid')
.as('pid')
|deadman(0.0, 20s)
.stateChangesOnly()
.exec('/kapacitor/tick-scripts/hpomi.py', 'Kafka Process Down Alert', 'Kafka Process Down', 'Kafka Process Down')
Using a stream instead of a batch will remove the LB from the equation.
so I did upgrade to the current release of influx and taking a node down did not trigger deadman to go off.....but for some reason, deadman got triggered on the env last night...
I did search for this E! Post
in the kapacitor logs but did not see anything in there...
will try to changeit to a stream too and see what happens...
I'd definitely recommend trying the stream task.
going to try it but can you help me understand why the logic would run differently in a stream vs batch?
so I just tried it an it seems the streams one is missing the tags output in the data when it alerts...no bueno :-(
in batch..the data object looks liek this:
"data": {
"series": [
{
"values": [
[
"2017-06-20T15:05:06.708243035Z",
18.38074168678113
]
],
"name": "cpu",
"columns": [
"time",
"mean"
],
"tags": {
"cluster": "influx",
"host": "ip-5.dqa.domain.com"
}
}
]
}
in the stream it looks like this:
"data": {
"series": [
{
"values": [
[
"2017-08-25T15:22:00Z",
0
]
],
"name": "stats",
"columns": [
"time",
"emitted"
]
}
]
}
seems to be missing the tags part...
Here's a reference with a bit more detail https://community.influxdata.com/t/batch-versus-stream-processing-combine/792/2
The difference between a stream and a batch is that with a batch task, Kapacitor issues queries to InfluxDB, and with a stream task, all writes to the InfluxDB cluster are mirrored to Kapacitor.
My running theory is that since the downed node is still a part of the LB, it's possible for Kapacitor to issue queries to the downed node, which would result in the deadman firing.
A stream task would remove the LB from the equation since any node if the cluster that receives a write will mirror that write to Kapacitor.
so I just tried it an it seems the streams one is missing the tags output in the data when it alerts...no bueno :-(
Not sure I understand what this means. Can you show me the resulting alert data.
yeah sorry...just made an edit to the post showing it...also...I'm on this version Kapacitor 1.3.1 (git: master 3b5512f7276483326577907803167e4bb213c613)
Was this alert data
"data": {
"series": [
{
"values": [
[
"2017-06-20T15:05:06.708243035Z",
18.38074168678113
]
],
"name": "cpu",
"columns": [
"time",
"mean"
],
"tags": {
"cluster": "influx",
"host": "ip-5.dqa.domain.com"
}
}
]
}
generated from a deadman?
so it just happened again and this is what I see in the kapacitor logs when it happened:
[uncleanelection_alert:query1] 2017/08/25 15:38:40 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[connectproc_alert:query1] 2017/08/25 15:38:44 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'connect' AND time >= '2017-08-25T15:37:59.949575339Z' AND time < '2017-08-25T15:38:44.949575339Z' GROUP BY cluster, host
[connectproc_alert:query1] 2017/08/25 15:38:44 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[cpu_alert:query1] 2017/08/25 15:38:44 D! starting next batch query: SELECT mean(usage_user) FROM sdp_monitoring."13m".cpu WHERE time >= '2017-08-25T15:38:14.955109092Z' AND time < '2017-08-25T15:38:44.955109092Z' GROUP BY cluster, host
[cpu_alert:query1] 2017/08/25 15:38:44 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[disk_alert:query1] 2017/08/25 15:38:44 D! starting next batch query: SELECT max(used_percent) FROM sdp_monitoring."13m".disk WHERE time >= '2017-08-25T15:33:44.966570636Z' AND time < '2017-08-25T15:38:44.966570636Z' GROUP BY cluster, host
[disk_alert:query1] 2017/08/25 15:38:44 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[mem_alert:query1] 2017/08/25 15:38:44 D! starting next batch query: SELECT mean(used_percent) FROM sdp_monitoring."13m".mem WHERE time >= '2017-08-25T15:37:44.993644144Z' AND time < '2017-08-25T15:38:44.993644144Z' GROUP BY cluster, host
[mem_alert:query1] 2017/08/25 15:38:45 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[schemaproc_alert:query1] 2017/08/25 15:38:45 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'schemar' AND time >= '2017-08-25T15:38:00.004506299Z' AND time < '2017-08-25T15:38:45.004506299Z' GROUP BY cluster, host
[schemaproc_alert:query1] 2017/08/25 15:38:45 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[schemarproc_alert:query1] 2017/08/25 15:38:45 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'schemar' AND time >= '2017-08-25T15:38:00.010261082Z' AND time < '2017-08-25T15:38:45.010261082Z' GROUP BY cluster, host
[schemarproc_alert:query1] 2017/08/25 15:38:45 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[zookeeperproc_alert:query1] 2017/08/25 15:38:45 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'zookeeper' AND time >= '2017-08-25T15:38:00.038542172Z' AND time < '2017-08-25T15:38:45.038542172Z' GROUP BY cluster, host
[zookeeperproc_alert:query1] 2017/08/25 15:38:45 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[kafkaproc_alert:query1] 2017/08/25 15:38:48 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'kafka' AND time >= '2017-08-25T15:38:03.070066125Z' AND time < '2017-08-25T15:38:48.070066125Z' GROUP BY cluster, host
[kafkaproc_alert:query1] 2017/08/25 15:38:48 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[kafka_upr_alert:query1] 2017/08/25 15:38:48 D! starting next batch query: SELECT mean(metric_value_number) FROM sdp_monitoring."13m".BrokerMetrics WHERE kafka_metric_name = 'UnderReplicatedPartitions' AND time >= '2017-08-25T15:38:08.585839971Z' AND time < '2017-08-25T15:38:48.585839971Z' GROUP BY cluster, host
[kafka_upr_alert:query1] 2017/08/25 15:38:48 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[offline_partition_alert:query1] 2017/08/25 15:38:49 D! starting next batch query: SELECT mean(metric_value_number) FROM sdp_monitoring."13m".BrokerMetrics WHERE kafka_metric_name = 'OfflinePartitionsCount' AND time >= '2017-08-25T15:38:09.701383591Z' AND time < '2017-08-25T15:38:49.701383591Z' GROUP BY cluster, host
[offline_partition_alert:query1] 2017/08/25 15:38:49 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[uncleanelection_alert:query1] 2017/08/25 15:38:50 D! starting next batch query: SELECT mean(metric_value_number) FROM sdp_monitoring."13m".BrokerMetrics WHERE kafka_metric_name = 'UncleanLeaderElectionsPerSec' AND time >= '2017-08-25T15:38:10.593466722Z' AND time < '2017-08-25T15:38:50.593466722Z' GROUP BY cluster, host
[uncleanelection_alert:query1] 2017/08/25 15:38:50 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[connectproc_alert:query1] 2017/08/25 15:38:54 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'connect' AND time >= '2017-08-25T15:38:09.949580921Z' AND time < '2017-08-25T15:38:54.949580921Z' GROUP BY cluster, host
[connectproc_alert:query1] 2017/08/25 15:38:54 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[cpu_alert:query1] 2017/08/25 15:38:54 D! starting next batch query: SELECT mean(usage_user) FROM sdp_monitoring."13m".cpu WHERE time >= '2017-08-25T15:38:24.955099869Z' AND time < '2017-08-25T15:38:54.955099869Z' GROUP BY cluster, host
[cpu_alert:query1] 2017/08/25 15:38:54 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[disk_alert:query1] 2017/08/25 15:38:54 D! starting next batch query: SELECT max(used_percent) FROM sdp_monitoring."13m".disk WHERE time >= '2017-08-25T15:33:54.966570028Z' AND time < '2017-08-25T15:38:54.966570028Z' GROUP BY cluster, host
[disk_alert:query1] 2017/08/25 15:38:54 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[mem_alert:query1] 2017/08/25 15:38:54 D! starting next batch query: SELECT mean(used_percent) FROM sdp_monitoring."13m".mem WHERE time >= '2017-08-25T15:37:54.993640223Z' AND time < '2017-08-25T15:38:54.993640223Z' GROUP BY cluster, host
[mem_alert:query1] 2017/08/25 15:38:54 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[schemaproc_alert:query1] 2017/08/25 15:38:55 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'schemar' AND time >= '2017-08-25T15:38:10.004497407Z' AND time < '2017-08-25T15:38:55.004497407Z' GROUP BY cluster, host
[schemarproc_alert:query1] 2017/08/25 15:38:55 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'schemar' AND time >= '2017-08-25T15:38:10.010253585Z' AND time < '2017-08-25T15:38:55.010253585Z' GROUP BY cluster, host
[schemaproc_alert:query1] 2017/08/25 15:38:55 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[schemarproc_alert:query1] 2017/08/25 15:38:55 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[zookeeperproc_alert:query1] 2017/08/25 15:38:55 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'zookeeper' AND time >= '2017-08-25T15:38:10.03853013Z' AND time < '2017-08-25T15:38:55.03853013Z' GROUP BY cluster, host
[zookeeperproc_alert:query1] 2017/08/25 15:38:55 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[kafkaproc_alert:query1] 2017/08/25 15:38:58 D! starting next batch query: SELECT count(pid) AS count FROM sdp_monitoring."13m".procstat WHERE pattern = 'kafka' AND time >= '2017-08-25T15:38:13.070055764Z' AND time < '2017-08-25T15:38:58.070055764Z' GROUP BY cluster, host
[kafkaproc_alert:query1] 2017/08/25 15:38:58 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
[kafka_upr_alert:query1] 2017/08/25 15:38:58 D! starting next batch query: SELECT mean(metric_value_number) FROM sdp_monitoring."13m".BrokerMetrics WHERE kafka_metric_name = 'UnderReplicatedPartitions' AND time >= '2017-08-25T15:38:18.585845744Z' AND time < '2017-08-25T15:38:58.585845744Z' GROUP BY cluster, host
[kafka_upr_alert:query1] 2017/08/25 15:38:58 E! failed to get conn: dial tcp 10.47.7.252:8088: getsockopt: connection refused
[offline_partition_alert:query1] 2017/08/25 15:38:59 D! starting next batch query: SELECT mean(metric_value_number) FROM sdp_monitoring."13m".BrokerMetrics WHERE kafka_metric_name = 'OfflinePartitionsCount' AND time >= '2017-08-25T15:38:19.701395415Z' AND time < '2017-08-25T15:38:59.701395415Z' GROUP BY cluster, host
[offline_partition_alert:query1] 2017/08/25 15:38:59 E! failed to get conn: dial tcp 10.47.1.42:8088: getsockopt: connection refused
all of those IPs are influx nodes...
naw...but just an example...that was a CPU one but deadman in batch includes the tags part too.
Whats the replication factor on sdp_monitoring."13m"
have a replication factor of 2....running 4 data nodes and 3 meta nodes...
@k1ng87 Just tested the stream version myself, and there's an initial trigger of the deadman that doesn't have tags, but that goes away after the deadman starts receiving data. Can you confirm that this is not the behavior you're seeing.
Before data:
"data":
{
"series": [
{
"name": "stats",
"columns": [
"time",
"emitted"
],
"values": [
[
"2017-08-25T16:18:10Z",
0
]
]
}
]
}
After data
"data":
{
"series": [
{
"name": "stats",
"tags": {
"cluster_id": "michaels-example-cluster",
"cpu": "cpu-total",
"host": "Michaels-MBP-2.router.edm"
},
"columns": [
"time",
"emitted"
],
"values": [
[
"2017-08-25T16:03:30Z",
1
]
]
}
]
}
Were those logs from the time when a data node was downed? Or just during normal runtime?
no, the data nodes were up and running...that happened when I started to see kapacitor throw alerts on deadman...that was also when the tick scripts were running in batch mode but now redeploying to run in stream mode
----EDIT------
correction...those two nodes were down...just didn't notice it before until a coworker pointed out the health check...
so I switched the tick scripts to do streams now but now I'm not getting alerts...I see that it is subscribed to influx and the tasks are running but don't see any alerts come through.
What's the output of
kapacitor show <task name>
This should show us where things are getting hung up.
nvm....configured the hostname wrong in the kapacitor config...working now
I'd like to open this backup as this still keeps happening...I did notice that kapacitor was maxing out the CPU so we bumped the ec2 to a c4.4xlarge but now they keep happening sporadically where all of the deadman alerts go off.
The ec2 runs on an ASG and if I terminate the ec2 and let ASG build a new one it works fine for some time but the then the behavior comes back. Looked at disk space and memory and those are fine too and influx is healthy as well but still unsure why this is still happening.
I have a tick script that looks like this:
and telegraf input that has this:
I'm also running a 7 node influx cluster (3 meta and 4 data) and have a replication policy on the
monitoring
db as 2. I've simulated that when I take on influx node down, Kapacitor will throw a process down alert and then right after throw a process up alert on state change. The processes are fine, looks like its being triggered by just taking one influxdb node down.