influxdata / kapacitor

Open source framework for processing, monitoring, and alerting on time series data
MIT License
2.31k stars 492 forks source link

Ping Alert does not seem to work #2376

Open tymac753 opened 4 years ago

tymac753 commented 4 years ago

I have the tick script below registered in Kapacitor and the IP in question is not up, so the alert - I assume should immediately fire - but no alert is generated. Kapacitor is on it's own VM separate from Capacitor and Influx, which each have their own VMs also. kapacitor stats ingress output list stats for _kapacitor but nothing for chronograf or telegraf which makes me wonder if things are configured correctly or if this is a bug. The db connection does show up green for the kapacitor connection in the chronograf configuration however.

Chronograf version 1.1.11 Influx version 18.0 Kapacitor version 1.5.5

`var db = 'newyork'

var rp = 'autogen'

var measurement = 'ping'

var groupBy = []

var whereFilter = lambda: ("policy_group" == 'ny_vmware_prod') AND ("policy_name" == 'ny_ping_server') AND ("url" == '10.1.2.3')

var name = 'NY IP 10.1.2.3 Down'

var idVar = name

var message = ' {{.ID}} is {{.Level}} value: {{ index .Fields "value" }} '

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'threshold'

var crit = 90

var data = stream |from() .database(db) .retentionPolicy(rp) .measurement(measurement) .groupBy(groupBy) .where(whereFilter) |eval(lambda: "percent_packet_loss") .as('value')

var trigger = data |alert() .crit(lambda: "value" > crit) .message(message) .id(idVar) .idTag(idTag) .levelTag(levelTag) .messageField(messageField) .durationField(durationField) .log('/var/log/chronograf/alerts.log') .log('/var/log/kapacitor/alerts.log') .log('/var/log/influxdb/alerts.log')

trigger |eval(lambda: float("value")) .as('value') .keep() |influxDBOut() .create() .database(outputDB) .retentionPolicy(outputRP) .measurement(outputMeasurement) .tag('alertName', name) .tag('triggerType', triggerType)

trigger |httpOut('output') `

bogski87 commented 4 years ago

Hi @tymac753

could you try changing var groupBy = [] to var groupBy = ['*']

If i recall, in order to use/filter tag values in TICK you have to include them in the group by clause. Then when those tags come back and the values match your whereFilter you should get an alert.

TyMac commented 4 years ago

Did not seem to change anything. No alerts are displayed.

bogski87 commented 4 years ago

Strange, let me check my templates

Is there any output from the kapacitor CLI if i check the task?

sudo kapacitor show task_name on Ubuntu,

We use a lot of ping sensors, so the plugin definitely works.

tymac753 commented 4 years ago

There is - Here is the output:

`kapacitor show chronograf-v1-e94d8589-19c0-46a9-836e-9d301c16ccf3 ID: chronograf-v1-e94d8589-19c0-46a9-836e-9d301c16ccf3 Error: Template: Type: stream Status: enabled Executing: true Created: 29 Jul 20 15:18 UTC Modified: 31 Jul 20 13:25 UTC LastEnabled: 31 Jul 20 13:25 UTC Databases Retention Policies: ["newyork"."autogen"] TICKscript: var db = 'newyork'

var rp = 'autogen'

var measurement = 'ping'

var groupBy = ['*']

var whereFilter = lambda: ("policy_group" == 'ny_vmware_prod') AND ("policy_name" == 'ny_ping_server') AND ("url" == '10.1.2.3')

var name = 'NY IP 10.1.2.3 Down'

var idVar = name

var message = ' {{.ID}} is {{.Level}} value: {{ index .Fields "value" }} '

var idTag = 'alertID'

var levelTag = 'level'

var messageField = 'message'

var durationField = 'duration'

var outputDB = 'chronograf'

var outputRP = 'autogen'

var outputMeasurement = 'alerts'

var triggerType = 'threshold'

var crit = 90

var data = stream |from() .database(db) .retentionPolicy(rp) .measurement(measurement) .groupBy(groupBy) .where(whereFilter) |eval(lambda: "percent_packet_loss") .as('value')

var trigger = data |alert() .crit(lambda: "value" > crit) .message(message) .id(idVar) .idTag(idTag) .levelTag(levelTag) .messageField(messageField) .durationField(durationField) .log('/var/log/chronograf/alerts.log') .log('/var/log/kapacitor/alerts.log') .log('/var/log/influxdb/alerts.log')

trigger |eval(lambda: float("value")) .as('value') .keep() |influxDBOut() .create() .database(outputDB) .retentionPolicy(outputRP) .measurement(outputMeasurement) .tag('alertName', name) .tag('triggerType', triggerType)

trigger |httpOut('output')

DOT: digraph chronograf-v1-e94d8589-19c0-46a9-836e-9d301c16ccf3 { graph [throughput="0.00 points/s"];

stream0 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ]; stream0 -> from1 [processed="0"];

from1 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ]; from1 -> eval2 [processed="0"];

eval2 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ]; eval2 -> alert3 [processed="0"];

alert3 [alerts_inhibited="0" alerts_triggered="0" avg_exec_time_ns="0s" crits_triggered="0" errors="0" infos_triggered="0" oks_triggered="0" warns_triggered="0" working_cardinality="0" ]; alert3 -> http_out6 [processed="0"]; alert3 -> eval4 [processed="0"];

http_out6 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ];

eval4 [avg_exec_time_ns="0s" errors="0" working_cardinality="0" ]; eval4 -> influxdb_out5 [processed="0"];

influxdb_out5 [avg_exec_time_ns="0s" errors="1" points_written="0" working_cardinality="0" write_errors="0" ]; }`

docmerlin commented 4 years ago

@tymac753 I could be wrong, but it looks like nothing is getting past first eval... maybe the filter is filtering everything out, or data isn't getting to the cap instance? Try logging before it hits the filter and see if you are seeing any data at all, if you are, then try logging after your filter and see if that is the problem?

tymac753 commented 4 years ago

Uncertain I know how to configure that. Can you provide a sample of what you mean by adding logging first? If it's any help, the default "ping" dashbaord does show data in it's charts.

bogski87 commented 4 years ago

You can capture logging by adding |log() at the end of each node, it should show you what Kapacitor is doing when data arrives. But, if data isn't being processed it might not show much.

if you run sudo tail -f /var/log/kapacitor/kapacitor.log then you should be able to see whether Kapacitor is happy or if there are any errors. Also check the influxDB logs for any connection errors between influx and kapacitor. If the dashboard is showing data then its between influx and kapacitor.

You could default the where filter to check if thats the problem.

var whereFilter = lambda: TRUE

tymac753 commented 4 years ago

hmm tailing the log I see a lot of this repeating:

ts=2020-08-06T12:22:47.166Z lvl=error msg="failed to connect to InfluxDB, retrying..." service=influxdb cluster=nyinfluxdb err="invalid character 'C' looking for beginning of value"

tymac753 commented 4 years ago

Ok one issue was that I had not reconfigured kapacitor for https after adding a cert, but still after reconfiguring that and seeing the session connect I still have no alerts.

I changed: var whereFilter = lambda: ("policy_group" == 'ny_vmware_prod') AND ("policy_name" == 'ny_ping_server') AND ("url" == '10.1.2.3')

to:

var whereFilter = lambda: TRUE

and I see nothing in the logs... and I'm obviously not getting what you're saying about the "|log() at the end of each node" as when I add:

var whereFilter = lambda: TRUE |.log()

I get: invalid TICKscript: parser: unexpected | line 14 char 2 in " |.log()". expected: "number","string","duration","identifier","TRUE","FALSE","==","(","-","!"

bogski87 commented 4 years ago

Hi, to add the logging on the node it should just be |log() - no period. it changes depending on where you add it. In your alert node, the syntax is right and its chained on to alert node. As i say though, if no data is processed then it won't show much.

Regarding SSL, is this a self signed certificate or bought from a CA? if its self signed then you will need to set [[influx]] section of the Kapacitor config to use skip_secure_verify.

Could you paste the [[influxdb]] section of your Kapacitor config, obviously remove any sensitive information.

Just to try and clarify about the |log() node and where to add it

var data = stream |from() .database(db) .retentionPolicy(rp) .measurement(measurement) .groupBy(groupBy) .where(whereFilter) |log()

that should log everything from the "from" node. You can add it to the end of other nodes in the same way, so at the end of the |eval() node if you wanted to see what was happening there. So if you were evaluating 2 fields and it was failing you could try like this and see what the issue is.

|eval(lambda: "some_field" / "other_field") .as('calculation') .keep('some_field','other_field')

|log()

tymac753 commented 4 years ago

Still not seeing any logging after adding |log() like you have above in the var data =stream block. I should see something in /var/log/kapacitor/kapacitor.log correct? The only thing I did briefly see was "unable to collect logs" when I restarted the kapacitor service - that popped up in the chronograf gui for a second.

bogski87 commented 4 years ago

Ok, as I expected. I don;t think you'll see much from |log() as no data is being processed.

For the log file /var/log/kapacitor/kapacitor.log you should be able to see everything Kapacitor is doing. Not sure if the "unable to collect logs" means the kapacitor logs in general or just the |log() output

Regarding your SSL. is it self signed?

tymac753 commented 4 years ago

I think that might have just been due to restarting the service as I have not seen it since. Here's my kapacitor log output:

ts=2020-08-10T12:09:01.791Z lvl=info msg="started task" service=kapacitor task_master=main task=chronograf-v1-6ee652a1-1abd-4951-b0df-ac259eb45654 ts=2020-08-10T12:09:01.805Z lvl=info msg="started task" service=kapacitor task_master=main task=chronograf-v1-94f30df0-0d27-46b7-99d5-776ddb42f311 ts=2020-08-10T12:09:01.822Z lvl=info msg="started task" service=kapacitor task_master=main task=chronograf-v1-a8af48b8-6a7d-41ae-8f83-e81eccd6b159 ts=2020-08-10T12:09:01.839Z lvl=info msg="started task" service=kapacitor task_master=main task=chronograf-v1-ac16190f-9261-48ec-998c-4eed5b989cf4 ts=2020-08-10T12:09:01.854Z lvl=info msg="started task" service=kapacitor task_master=main task=chronograf-v1-e94d8589-19c0-46a9-836e-9d301c16ccf3 ts=2020-08-10T12:09:01.854Z lvl=info msg="starting HTTP service" service=http ts=2020-08-10T12:09:01.854Z lvl=info msg=authentication service=http enabled=false ts=2020-08-10T12:09:01.855Z lvl=info msg="Starting target manager..." service=scraper ts=2020-08-10T12:09:01.856Z lvl=info msg="listening on" service=http addr=0.0.0.0:9092 protocol=http ts=2020-08-10T12:09:01.856Z lvl=info msg="listening for signals" service=run

The SSL cert is indeed self signed.

bogski87 commented 4 years ago

ok doke. i think we might be getting somewhere :)

So, is the SSL cert for influx? so your agent is sending data to https://your_influx_db:8086?

If you run sudo journalctl -fu influxdb do you see any errors like this image

Just wondering whether its a case of setting this option to true in Kapacitor.conf, in the [[influxdb]] section of the config. image

tymac753 commented 4 years ago

Nothing with connection refused in it:

Aug 11 20:44:31 nyinfluxdb influxd[19743]: ts=2020-08-11T20:44:31.675655Z lvl=info msg="Post http://nykapacitor:9092/write?consistency=&db=newyork&precision=ns&rp=autogen: dial tcp: lookup nykapacitor on 10.1.2.3:53: server misbehaving" log_id=0O8YApV0000 service=subscriber Aug 11 20:44:31 nyinfluxdb influxd[19743]: ts=2020-08-11T20:44:31.675674Z lvl=info msg="Post http://nykapacitor:9092/write?consistency=&db=newyork&precision=ns&rp=autogen: dial tcp: lookup nykapacitor on 10.1.2.3:53: server misbehaving" log_id=0O8YApV0000 service=subscriber Aug 11 20:44:31 nyinfluxdb influxd[19743]: [httpd] 10.7.50.48 - telegraf [11/Aug/2020:20:44:31 +0000] "POST /write?db=newyork HTTP/1.1" 204 0 "-" "Telegraf/1.14.4" 752f2ce8-dc13-11ea-b478-00505696b4f4 2686 Aug 11 20:44:32 nyinfluxdb influxd[19743]: ts=2020-08-11T20:44:32.200299Z lvl=info msg="Post http://nykapacitor:9092/write?consistency=&db=newyork&precision=ns&rp=autogen: dial tcp: lookup nykapacitor on 10.1.2.3:53: server misbehaving" log_id=0O8YApV0000 service=subscriber Aug 11 20:44:32 nyinfluxdb influxd[19743]: ts=2020-08-11T20:44:32.200355Z lvl=info msg="Post http://nykapacitor:9092/write?consistency=&db=newyork&precision=ns&rp=autogen: dial tcp: lookup nykapacitor on 10.1.2.3:53: server misbehaving" log_id=0O8YApV0000 service=subscriber Aug 11 20:44:32 nyinfluxdb influxd[19743]: [httpd] 10.5.50.27 - telegraf [11/Aug/2020:20:44:32 +0000] "POST /write?db=newyork HTTP/1.1" 204 0 "-" "Telegraf/1.14.4" 757f2b0e-dc13-11ea-b479-00505696b4f4 3344 Aug 11 20:44:32 nyinfluxdb influxd[19743]: ts=2020-08-11T20:44:32.500231Z lvl=info msg="Post http://nykapacitor:9092/write?consistency=&db=newyork&precision=ns&rp=autogen: dial tcp: lookup nykapacitor on 10.1.2.3:53: server misbehaving" log_id=0O8YApV0000 service=subscriber Aug 11 20:44:32 nyinfluxdb influxd[19743]: ts=2020-08-11T20:44:32.500233Z lvl=info msg="Post http://nykapacitor:9092/write?consistency=&db=newyork&precision=ns&rp=autogen: dial tcp: lookup nykapacitor on 10.1.2.3:53: server misbehaving" log_id=0O8YApV0000 service=subscriber Aug 11 20:44:32 nyinfluxdb influxd[19743]: [httpd] 10.5.50.37 - telegraf [11/Aug/2020:20:44:32 +0000] "POST /write?db=newyork HTTP/1.1" 204 0 "-" "Telegraf/1.14.4" 75acf076-dc13-11ea-b47a-00505696b4f4 2896 Aug 11 20:44:32 nyinfluxdb influxd[19743]: ts=2020-08-11T20:44:32.511548Z lvl=info msg="Post http://nykapacitor:9092/write?consistency=&db=newyork&precision=ns&rp=autogen: dial tcp: lookup nykapacitor on 10.1.2.3:53: server misbehaving" log_id=0O8YApV0000 service=subscriber Aug 11 20:44:32 nyinfluxdb influxd[19743]: ts=2020-08-11T20:44:32.511581Z lvl=info msg="Post http://nykapacitor:9092/write?consistency=&db=newyork&precision=ns&rp=autogen: dial tcp: lookup nykapacitor on 10.1.2.3:53: server misbehaving" log_id=0O8YApV0000 service=subscriber Aug 11 20:44:32 nyinfluxdb influxd[19743]: [httpd] 10.5.50.49 - telegraf [11/Aug/2020:20:44:32 +0000] "POST /write?db=newyork HTTP/1.1" 204 0 "-" "Telegraf/1.14.4" 75aebeca-dc13-11ea-b47b-00505696b4f4 2728

Influxdb section in kapacitor.conf looks like: ` [[influxdb]] enabled = true default = true name = "nyinfluxdb" urls = ["https://10.xxxx.xxx.xxx:8086"] username = "telegraf" password = "xxxxxxxxxx" timeout = 0 insecure-skip-verify = true

startup-timeout = "5m"

subscription-protocol = "http"

http-port = 0

udp-read-buffer = 0

[influxdb.subscriptions]

[influxdb.excluded-subscriptions]

`

Is there anything in /etc/default/kapacitor it need? I have nothing in there.

All the services are on different VMs (influxdb / chronograf /kapacitor)

bogski87 commented 4 years ago

Is the servers host name in the kapacitor config? at the top of the config, it needs to be resolvable by your influxdb VM. Its right at the top. I don;t think you need anything from /etc/default/kapacitor. all i see in there if i look is references to UDF's we've implemented

tymac753 commented 4 years ago

I you talking about: name = "nyinfluxdb"

The fqdn is but not that short name if that's what you're after.

bogski87 commented 4 years ago

This section right at the top

image

The name section in kapacitors [[influxdb]] config is more if you're working with different influx Instances. you can assign a name to it and target the second database in TICK script.

tymac753 commented 4 years ago

Ok nothing was in that. I added the FQDN in however there are no alerts generated yet.

bogski87 commented 4 years ago

HI, I'm not sure if you need the full domain name. Just the host name should suffice and DNS should do the rest. I run everything on one server.

Are there any other messages in the logs after making the change? from what i've read, the message in your previous logs

dial tcp: lookup nykapacitor on 10.1.2.3:53: server misbehaving"

indicates the influx server is expecting a different host name when it tries to connect to Kapacitor. So if your server is called "kapacitor-server" but the kapacitor config has localhost defined as the host name then Influx won't connect.

Last thing I can think of trying is setting the host name to just the kapacitor server name, no domain name and then add entries in each servers host file to make sure they can definitely see each other.

I'm running out of ideas though, sorry! you might have more luck on the community forums. https://community.influxdata.com/

There are a few support staff on there and some of the community users are quite helpful.

tymac753 commented 4 years ago

Kapacitor was not using SSL to connect, so I fixed that. Still no alerts but I do not see those misbehaving logs in the influxdb journal output. I am getting this now however: http://nykapacitor:9092/write?consistency=&db=nyinfluxdb&precision=ns&rp=autogen: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)" log_id=0O8YApV0000 service=subscriber

bogski87 commented 4 years ago

Ok, sorry for the delay I didn't see this.

I think that might be down to the kapacitor subscriptions. Because of the changes in the kapacitor config the subscription endpoint changes.

if you connect to the influx CLI and run show subscriptions you should get an output image

If you stop the Kapacitor service you can remove these subscriptions. It does NOT delete you data. DROP SUBSCRIPTION "sub0" ON "mydb"."autogen"

Then restart Kapacitor and it will try and connect and create a new subscription.

manage subscriptions

tymac753 commented 4 years ago

On the influx cli I saw this when I ran show subscriptions:

InfluxDB shell version: 1.8.0

show subscriptions name: _internal retention_policy name mode destinations


monitor kapacitor-23506a45-fbcb-48f2-b4a4-e2d88291a31b ANY [http://nykapacitor:9092] monitor kapacitor-2dc9ad3f-41bd-4a53-ab24-8de8032c06e2 ANY [https://nykapacitor:9092]

name: nyinfluxdb retention_policy name mode destinations


autogen kapacitor-23506a45-fbcb-48f2-b4a4-e2d88291a31b ANY [http://nykapacitor:9092] autogen kapacitor-2dc9ad3f-41bd-4a53-ab24-8de8032c06e2 ANY [https://nykapacitor:9092]

name: chronograf retention_policy name mode destinations


autogen kapacitor-23506a45-fbcb-48f2-b4a4-e2d88291a31b ANY [http://nykapacitor:9092] autogen kapacitor-2dc9ad3f-41bd-4a53-ab24-8de8032c06e2 ANY [https://nykapacitor:9092]

I removed all subscriptions from chronograf, nyinfluxdb, and _internal. Restarting kapacitor recreated all three however I still do not see any alerts. I still see the "Client.Timeout exceeded while awaiting headers" output with journalctl -fu influxdb.