influxdata / kapacitor

Open source framework for processing, monitoring, and alerting on time series data
MIT License
2.31k stars 492 forks source link

Stream nodes not receiving data on kapacitor restart #770

Closed ejether closed 8 years ago

ejether commented 8 years ago

I'm having some difficulty keeping kapacitor happy and receiving data from influx.

I have two ticks, one reporting all data from a measurement to the logs

var qsize = stream
    |from()
        .measurement('activemq.queue')
    |log().level('INFO')

and another reporting to alerta with some conditions:

var qsize = stream
    |from()
        .measurement('activemq.queue')
    .where(lambda: "attributeName" == 'QueueSize')
    |alert().
      .crit(lambda: "QueueSize" > 0)
      .alerta().resource('VCOR DLQ').event('Too many dead letters').value('{{ index .Fields "QueueSize"}}').environment('Development').details('Clear DLQ').message('Stuff')

I was manually setting the log level to DEBUG in the kapacitor.conf and our CM was running and setting it back to INFO every hour. When this happened, Kapacitor would stop acknowledging data from Influx.

Using netcat I verified that the UDP port was open and listening the Kapacitor server and reachable by the Influx server. Using tcpdump I could see that the packets were being received by on the Kapacitor server.

After this interruption, restarting the Kapacitor service and/or the Influx service had no apparent effect. They only way I was able to get Kapacitor to accept streaming data again was by deleting the contents of /var/lib/kapacitor, restarting the service and re-defining and enabling the tasks.

I have followed this procedure after configuration management set the logging level to INFO and restarted and it has been running as expected for a while now.

Hopefully I'm just missing something and you can tell me what I am doing wrong. Let me know if you need any more information. Thanks

ejether commented 8 years ago

Possible the logging level thing is a red herring. Just by doing a service kapacitor restart the ticks stop catching data.

nathanielc commented 8 years ago

Possible the logging level thing is a red herring.

This sounds reasonable since the dynamic switching of log levels is not well tested.

Can you share the output of kapacitor show TASK_NAME for the two task before and after the issue?

The output of kapacitor stats ingress would be useful as well.

Which version are you using?

ejether commented 8 years ago

Thanks for responding!

kapacitor stats ingress isn't an available command for me.

Version Kapacitor 0.13.1 (git: master ae7dde3aa467c46c06a5a2058a80c6125203f599)

Alerta Tick

[root@qa-alert-01 kapacitor]# kapacitor show qa-vcor-amq-dlq
ID: qa-vcor-amq-dlq
Error:
Type: stream
Status: enabled
Executing: true
Created: 02 Aug 16 18:29 EDT
Modified: 02 Aug 16 18:54 EDT
LastEnabled: 02 Aug 16 18:54 EDT
Databases Retention Policies: ["jmxtrans"."default"]
TICKscript:
var qsize = stream
    |from()
        .measurement('activemq.queue')
        .where(lambda: "attributeName" == 'QueueSize')
    |alert()
        .crit(lambda: "QueueSize" > 0)
        .alerta()
        .resource('VCOR DLQ')
        .event('Too many dead letters')
        .value('{{ index .Fields "QueueSize"}}')
        .environment('Development')
        .details('Clear DLQ')
        .message('Stuff')

DOT:
digraph qa-vcor-amq-dlq {
graph [throughput="0.00 points/s"];

stream0 [avg_exec_time_ns="0" ];
stream0 -> from1 [processed="0"];

from1 [avg_exec_time_ns="0" ];
from1 -> alert2 [processed="0"];

alert2 [alerts_triggered="0" avg_exec_time_ns="0" crits_triggered="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" ];
}

Log tick

[root@qa-alert-01 kapacitor]# kapacitor show log
ID: log
Error:
Type: stream
Status: enabled
Executing: true
Created: 02 Aug 16 18:29 EDT
Modified: 02 Aug 16 18:53 EDT
LastEnabled: 02 Aug 16 18:53 EDT
Databases Retention Policies: ["jmxtrans"."default"]
TICKscript:
var qsize = stream
    |from()
        .measurement('activemq.queue')
    |log()
        .level('INFO')

DOT:
digraph log {
graph [throughput="0.00 points/s"];

stream0 [avg_exec_time_ns="0" ];
stream0 -> from1 [processed="0"];

from1 [avg_exec_time_ns="0" ];
from1 -> log2 [processed="0"];

log2 [avg_exec_time_ns="0" ];
}

Log file contents on startup

[root@qa-alert-01 kapacitor]# cat /var/log/kapacitor/kapacitor.log
[run] 2016/08/02 19:03:45 I! Kapacitor starting, version 0.13.1, branch master, commit ae7dde3aa467c46c06a5a2058a80c6125203f599
[run] 2016/08/02 19:03:45 I! Go version go1.6.2, GOMAXPROCS set to 1
[srv] 2016/08/02 19:03:45 I! Kapacitor hostname: qa-alert-01.ost.smarshqa.com
[srv] 2016/08/02 19:03:45 I! ClusterID: 9393239a-962a-4723-a4b6-8fef99c78149 ServerID: a76d4d51-5c8a-4ee5-8f8c-58dc2e3fcff1
[task_master] 2016/08/02 19:03:45 I! opened
[srv] 2016/08/02 19:03:45 D! opening service: *udf.Service
[srv] 2016/08/02 19:03:45 D! opened service: *udf.Service
[srv] 2016/08/02 19:03:45 D! opening service: *deadman.Service
[srv] 2016/08/02 19:03:45 D! opened service: *deadman.Service
[srv] 2016/08/02 19:03:45 D! opening service: *influxdb.Service
[influxdb] 2016/08/02 19:03:45 I! linking subscriptions
[udp:jmxtrans.default] 2016/08/02 19:03:45 I! Started listening on UDP: [::1]:49897
[influxdb] 2016/08/02 19:03:45 I! started UDP listener for jmxtrans default
[udp:_internal.monitor] 2016/08/02 19:03:45 I! Started listening on UDP: [::1]:44320
[influxdb] 2016/08/02 19:03:45 I! started UDP listener for _internal monitor
[udp:collectd.default] 2016/08/02 19:03:45 I! Started listening on UDP: [::1]:48197
[influxdb] 2016/08/02 19:03:45 I! started UDP listener for collectd default
[udp:http.default] 2016/08/02 19:03:45 I! Started listening on UDP: [::1]:44614
[influxdb] 2016/08/02 19:03:45 I! started UDP listener for http default
[udp:web_archiving.default] 2016/08/02 19:03:45 I! Started listening on UDP: [::1]:43578
[influxdb] 2016/08/02 19:03:45 I! started UDP listener for web_archiving default
[udp:graphite.default] 2016/08/02 19:03:45 I! Started listening on UDP: [::1]:43276
[influxdb] 2016/08/02 19:03:45 I! started UDP listener for graphite default
[udp:jmxtrans_prr.default] 2016/08/02 19:03:45 I! Started listening on UDP: [::1]:43669
[influxdb] 2016/08/02 19:03:45 I! started UDP listener for jmxtrans_prr default
[udp:jmxtrans.autogen] 2016/08/02 19:03:45 I! Started listening on UDP: [::1]:60522
[influxdb] 2016/08/02 19:03:45 I! started UDP listener for jmxtrans autogen
[srv] 2016/08/02 19:03:45 D! opened service: *influxdb.Service
[srv] 2016/08/02 19:03:45 D! opening service: *storage.Service
[srv] 2016/08/02 19:03:45 D! opened service: *storage.Service
[srv] 2016/08/02 19:03:45 D! opening service: *task_store.Service
[task_store] 2016/08/02 19:03:45 W! could not open old boltd for task_store. Not performing migration. Remove the `task_store.dir` configuration to disable migration.
[task_store] 2016/08/02 19:03:45 D! starting enabled task on startup log
[task_master] 2016/08/02 19:03:45 D! Starting task: log
[task_master] 2016/08/02 19:03:45 I! Started task: log
[task_master] 2016/08/02 19:03:45 D! digraph log {
stream0 -> from1;
from1 -> log2;
}
[task_store] 2016/08/02 19:03:45 D! started task during startup log
[task_store] 2016/08/02 19:03:45 D! starting enabled task on startup qa-vcor-amq-dlq
[task_master] 2016/08/02 19:03:45 D! Starting task: qa-vcor-amq-dlq
[task_master] 2016/08/02 19:03:45 I! Started task: qa-vcor-amq-dlq
[task_master] 2016/08/02 19:03:45 D! digraph qa-vcor-amq-dlq {
stream0 -> from1;
from1 -> alert2;
}
[task_store] 2016/08/02 19:03:45 D! started task during startup qa-vcor-amq-dlq
[srv] 2016/08/02 19:03:45 D! opened service: *task_store.Service
[srv] 2016/08/02 19:03:45 D! opening service: *replay.Service
[srv] 2016/08/02 19:03:45 D! opened service: *replay.Service
[srv] 2016/08/02 19:03:45 D! opening service: *alerta.Service
[srv] 2016/08/02 19:03:45 D! opened service: *alerta.Service
[srv] 2016/08/02 19:03:45 D! opening service: *stats.Service
[stats] 2016/08/02 19:03:45 I! opened service
[srv] 2016/08/02 19:03:45 D! opened service: *stats.Service
[srv] 2016/08/02 19:03:45 D! opening service: *reporting.Service
[srv] 2016/08/02 19:03:45 D! opened service: *reporting.Service
[srv] 2016/08/02 19:03:45 D! opening service: *httpd.Service
[httpd] 2016/08/02 19:03:45 I! Starting HTTP service
[httpd] 2016/08/02 19:03:45 I! Authentication enabled: false
[httpd] 2016/08/02 19:03:45 I! Listening on HTTP: [::]:9092
[srv] 2016/08/02 19:03:45 D! opened service: *httpd.Service
[run] 2016/08/02 19:03:45 I! Listening for signals
ejether commented 8 years ago

I upgraded to the beta and the issue seems to have gone away. Do you know the timeline for version 1 stable release?

[root@qa-alert-01 ~]# kapacitor version
Kapacitor 1.0.0~beta3 (git: master d0485013e30c91413d2082b2bd5842d2a636efff)
nathanielc commented 8 years ago

@ejether Glad to see the issue is resolved. The 1.0-beta releases switched to using http as the transport instead of UDP which is probably how it started magically working.

At this point I would consider the 1.0-beta4 release to be more stable than the 0.13.1 release. There have been significant bug fixes since 0.13.1 was released and not too many large features have been added. We are still in beta since we are not quite ready yet to mark it as a 1.0 stable release. As for when 1.0 will be ready for release: soon ;). At this point there are no more planned changes that need to happen only testing and bug fixing anything that comes up.