Open LatticeEngineering opened 7 years ago
@LatticeEngineering Would it be possible to get the output of
kapacitor stats ingress
kapacitor stats general
Additionally, it would be helpful to get the output for kapacitor show <task>
. For one of the two types of tasks you have.
[root@kapacitor2 ~]# kapacitor stats ingress Database Retention Policy Measurement Points Received _internal monitor cq 1 _internal monitor database 5 _internal monitor httpd 1 _internal monitor queryExecutor 1 _internal monitor runtime 1 _internal monitor shard 48 _internal monitor subscriber 7 _internal monitor tsm1_cache 48 _internal monitor tsm1_engine 48 _internal monitor tsm1_filestore 48 _internal monitor tsm1_wal 48 _internal monitor write 1 telegraf one_year cloudwatch_aws_application_elb 124 telegraf one_year cloudwatch_aws_auto_scaling 19 telegraf one_year cloudwatch_aws_ebs 16 telegraf one_year cloudwatch_aws_ec2 19 telegraf one_year cloudwatch_aws_ecs 25 telegraf one_year cloudwatch_aws_elb 101 telegraf one_year cloudwatch_aws_ops_works 318 telegraf one_year cloudwatch_aws_rds 234 telegraf one_year cpu 1987 telegraf one_year disk 8383 telegraf one_year diskio 934 telegraf one_year docker 1557 telegraf one_year docker_container_blkio 6664 telegraf one_year docker_container_cpu 24119 telegraf one_year docker_container_mem 1544 telegraf one_year docker_container_net 1037 telegraf one_year docker_data 519 telegraf one_year docker_metadata 519 telegraf one_year http_response_grafana 5 telegraf one_year http_response_kafka_conn 12 telegraf one_year http_response_kafka_sreg 12 telegraf one_year http_response_namenode 4 telegraf one_year kernel 792 telegraf one_year mem 792 telegraf one_year net_response_RDS 16 telegraf one_year net_response_SQL 3 telegraf one_year net_response_influx 8 telegraf one_year net_response_kafka 12 telegraf one_year net_response_resourcemanager 4 telegraf one_year net_response_splunk 4 telegraf one_year netstat 515 telegraf one_year processes 787 telegraf one_year swap 549 telegraf one_year system 1584 telegraf one_year win_cpu 10 telegraf one_year win_disk 10 telegraf one_year win_mem 8 telegraf one_year win_system 3 telegraf one_year zookeeper 3 telegraf one_year zookeeper_kafka_zoo 12 telegraf one_year zookeeper_zkelb 4
[root@kapacitor2 ~]# kapacitor stats general
ClusterID: 93ada927-097c-41b1-954f-56a4c9ddf997
ServerID: 16047fb4-c530-448b-b9cd-555b61664519
Host: kapacitor2
Tasks: 24
Enabled Tasks: 22
Subscriptions: 6
Version: 1.3.0
The log file kaplog.txt
A couple of show tasks: [root@kapacitor2 ~]# kapacitor show kafka_sreg_dm ID: kafka_sreg_dm Error: Template: Type: stream Status: enabled Executing: true Created: 21 Apr 17 19:53 UTC Modified: 03 Aug 17 00:04 UTC LastEnabled: 03 Aug 17 00:04 UTC Databases Retention Policies: ["telegraf"."one_year"] TICKscript: // Dataframe var dm_period = 60s
var dm_points = 1.0
stream |from() .database('telegraf') .retentionPolicy('one_year') .measurement('http_response_kafka_sreg') .groupBy('server') |deadman(dm_points, dm_period) .id('{{ index .Tags "server"}}/Schema Registry/{{ index .Tags "port" }}') .slack() .pagerDuty()
DOT: digraph kafka_sreg_dm { graph [throughput="0.00 points/s"];
stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ]; stream0 -> from1 [processed="1140"];
from1 [avg_exec_time_ns="4.058µs" errors="0" working_cardinality="0" ]; from1 -> noop3 [processed="1140"];
noop3 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];
stats2 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ]; stats2 -> derivative4 [processed="22"];
derivative4 [avg_exec_time_ns="1.986µs" errors="0" working_cardinality="4" ]; derivative4 -> alert5 [processed="18"];
alert5 [alerts_triggered="0" avg_exec_time_ns="11.249µs" crits_triggered="0" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="3" ]; } [root@kapacitor2 ~]# kapacitor show kafka_sreg_ping ID: kafka_sreg_ping Error: Template: Type: stream Status: enabled Executing: true Created: 21 Apr 17 19:53 UTC Modified: 03 Aug 17 00:04 UTC LastEnabled: 03 Aug 17 00:04 UTC Databases Retention Policies: ["telegraf"."one_year"] TICKscript: // Dataframe var period = 5m
var every = 5m
var maxresponse = 1.5
var crit = 5
var warn = 2
var data = stream |from() .database('telegraf') .retentionPolicy('one_year') .measurement('http_response_kafka_sreg') .groupBy('server') .where(lambda: "response_time" > maxresponse) |window() .period(period) .every(every) |count('response_time') .as('num_slow')
var alert = data |alert() .id('{{ index .Tags "server"}}/Schema Registry Ping/{{ index .Tags "port" }}') .message('{{ .ID }}:possible Schema Registry connectivity issues') .crit(lambda: "num_slow" > crit) .slack()
DOT: digraph kafka_sreg_ping { graph [throughput="0.00 points/s"];
stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ]; stream0 -> from1 [processed="1140"];
from1 [avg_exec_time_ns="3.015µs" errors="0" working_cardinality="0" ]; from1 -> window2 [processed="90"];
window2 [avg_exec_time_ns="2.721µs" errors="0" working_cardinality="3" ]; window2 -> count3 [processed="1"];
count3 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ]; count3 -> alert4 [processed="1"];
alert4 [alerts_triggered="1" avg_exec_time_ns="0s" crits_triggered="1" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="1" ]; }
I turned off the 2 biggest frequent flyers. Hoping our engineers fix the issue one day. If needed, I'll turn them back on.
Thank you.
Hi there, we also encountered an out of memory error with Kapacitor.
The behaviour was quite strange actually, it was not really a smooth growing leap but a very big and abrupt one:
Kapacitor ran since 3 days at about the same level of ram usage but in 5 minutes, completely fill the 32go of ram of the host and went on an out of memory panic error.
Here are the output of kapacitor stats (myMetricX is just to obfuscate the name of internal custom metrics for privacy purposes) :
kapacitor_stats_ingress.txt kapacitor_stats_general.txt
Syslog doesn't seems to have anything relevant around the beginning of the big leap a part that influx was unable to post anything to Kapacitor:
Aug 20 18:50:40 myHost influxd[18449]: [I] 2017-08-20T16:50:40Z Post http://127.0.0.1:9092/write?consistency=&db=telegraf&precision=ns&rp=realtime: net/http: request canceled (Client.Timeout exceeded while awaiting headers) service=subscriber
Aug 20 18:50:40 myHost influxd[18449]: message repeated 5 times: [ [I] 2017-08-20T16:50:40Z Post http://127.0.0.1:9092/write?consistency=&db=telegraf&precision=ns&rp=realtime: net/http: request canceled (Client.Timeout exceeded while awaiting headers) service=subscriber]
...
...
...
Aug 20 19:12:13 myHost kapacitord[18500]: fatal error: runtime: out of memory
Aug 20 19:12:13 myHost kapacitord[18500]: runtime stack:
Aug 20 19:12:13 myHost kapacitord[18500]: runtime.throw(0x1fd2e7b, 0x16)
Aug 20 19:12:13 myHost kapacitord[18500]: #011/usr/local/go/src/runtime/panic.go:566 +0x95
Aug 20 19:12:13 myHost kapacitord[18500]: runtime.sysMap(0xcc175e0000, 0x100000, 0x456200, 0x2eaa838)
Aug 20 19:12:13 myHost kapacitord[18500]: #011/usr/local/go/src/runtime/mem_linux.go:219 +0x1d0
...
We will investigate if we find a TICK script completely crashing and burning Kapacitor but Kapacitor should be robust about failing TICK script.
Thanks a lot for you work and have a nice day, Albin.
Kapacitor 1.3.0
Happens about once an hour. This is new. Same tick scripts for months. I moved from a AWS t2.small to a t2.medium hoping moving from 2G to 4G would help. What can I post as helpful diagnostics.
I generate about 100 events an hour. I have 24 tick script.
About half are 60 second averaging, the other half 5 minutes.
Systems feeding into telegraf approx 100.
Is there a metric for memory requirements?