influxdb2 does not send notifications

omers commented 4 years ago

I setup influxdb2 to send alerts to our slack channel. I see only alerts that were sent 4 days ago and no alerts received to slack channel

for debugging, I created a flask app to listen as an http endpoint, and created an http notification rule to route all the alerts to the flask app. I see no requests

Steps to reproduce:

Setup an alert
Setup slack notification endpoint
Create alert that set the status to Critical
Go to the alert history page and check the alert was sent

Expected behavior: Alerts will be delievered

Actual behavior: Describe What actually happened.

Environment info:

System info: Linux 5.3.0-1023-aws x86_64
InfluxDB version: InfluxDB 2.0.0-beta.13 (git: 86796ddf2d) build_date: 2020-06-28T06:23:47Z
Other relevant environment details: influxd runs as a daemon on Ubuntu OS

Config: /usr/sbin/influxd run --engine-path /influx/engine --bolt-path /influx/boltdb.db --http-bind-address 127.0.0.1:9999 --log-level info

The last time alerts were sent was 4 days ago:

marcosciatta commented 4 years ago

Same issue. But it seems that only occurs with thresholds alert, at least for me. InfluxDB 2.0.0-beta.12 (git: ff620782eb) build_date: 2020-07-01T10:38:19Z In some case seems that it's random some check are sended other not.

Check works well, i can see the row in the check's history but I don't see the correlated row in notification history.

xiaoxuanzi commented 4 years ago

Same issue. influxdb version:

influxdb_2.0.0-beta.13_linux_amd64

mjf commented 4 years ago

The very same issue here. :disappointed:

# influxd2 version
InfluxDB 2.0.0-beta.13 (git: 86796ddf2d) build_date: 2020-07-08T11:07:50Z

I don't see anything in the notification tab (but checks seem to work). It also seems that alerts are beying scheduled somehow (at least the last running time seems to get updated regularly) and perhaps even run (?) but nothing really happens. I am trying the HTTP PUSH method, just to clarify it more. And of course, nothing special gets logged... :cry:

How to debug it to know more, please? :grin:

omers commented 4 years ago

Any feedback from influx team?

mjf commented 4 years ago

@omers No.

Pupix commented 4 years ago

I had encountered the same problem during the past week. The notification rules seem to work only for checks created before the rules was created. Newly made checks don't fire notifications.

To "fix" it, you have to do a PUT (PATCH doesn't work) request on the rule, with the same data, effectivly overriding it with itself and that seems to refresh the list of checks that the rule is looking for.

zeesumwang commented 4 years ago

+1

alex88 commented 4 years ago

I see the same error on my end as I've reported here https://community.influxdata.com/t/notification-when-status-changes-from-ok-crit/13847/9 Tried multiple configs:

with and without tags
state change from any to crit and crit to any
when state is crit all with 1 minute interval and check 1 minute interval too, no notifications has been sent out

ivanpricewaycom commented 4 years ago

+1 we're running version=2.0.0-beta.15, statuses are being written by checks, the statuses are changing, but notifications using the 'changes from' type are not firing.

notifications for 'is equal to' work as expected.

i tried 'reputting' the rule as per @Pupix 's suggestion (via the UI), makes no difference.

it seems this is the exact same problem as: https://github.com/influxdata/influxdb/issues/17809

seems to me the issue should be reopened...

-ivan

ivanpricewaycom commented 4 years ago

adding to my previous comment, after playing with different intervals for the notification rule and check interval i believe that this is related to:

https://github.com/influxdata/influxdb/issues/18284

indeed setting the notification rule to ~1m9s, with a check interval of 30s, ensures that there is 'always' (almost) a check that has run in the notification window, and i am receiving state change notifs (via http).

this setup seems fragile and i'm not sure we're able to rely on it for production use however... will continue experimenting.

ivanpricewaycom commented 4 years ago

following on from this, we have now modified our config so that the checks run every 10s, and the notification every 30s, but there are still notifications being missed.

executing the following:

import "influxdata/influxdb/monitor"

monitor.from(start: -30s)
  |> filter(fn: (r) => r["line_code"] == "LI-XXXX")
  |> monitor.stateChanges(toLevel: "crit")

results in the 'correct' list of state changes, but not all of these changes result in a notification.

is there a way to have more debug information for the notification sending, to try to understand where the ones that are not being sent are failing ?

omers commented 4 years ago

@ivanpricewaycom I couldn't find any option to debug alerts and rules. So I guess we need to wait for a response from Influx team

docere-priv commented 4 years ago

I've observed the same issue in my setup with 'changes from' notification rule. For 'is equal to' notifications are sent to the http endpoint (however, interestingly I am always receiving the same event 3 times with the same timestamp).

I am still trying to find out if workaround @ivanpricewaycom works. Unfortunately, without much successes until now :(.

I am using beta.16 version. # influxd version InfluxDB 2.0.0-beta.16 (git: 50964d732c) build_date: 2020-08-07T20:18:07Z

abhi1693 commented 4 years ago

@mhall119 Any ETA on fixing this issue? I have confirmed and the alerts and notifications are working on 2.0.0-beta.9 branch. Is there a way that I can build a docker image for that tag?

mhall119 commented 4 years ago

@abhi1693 right now all efforts are on finishing the storage engine change, after than I think there is a new version of Flux that's ready to be added which might have a fix for this.

abhi1693 commented 4 years ago

@mhall119 Thanks for replying so soon. Is there a timeline on this?

mhall119 commented 4 years ago

The storage engine change is currently in the works, I think that's supposed to land in the next week or so. After than I'm not as sure on the schedule, bug you can ask in our Slack in the #influxdb-v2 channel

russorat commented 4 years ago

I am able to confirm the following in the current beta-16 build:

checks are firing as expected and writing to the _monitoring bucket as expected
Configuring endpoints seems to work as expected. I configured both the HTTP endpoint and the Slack endpoint using a url from https://webhook.site/
NotificationRules are running at the correct interval and generating the correct flux codem, but notification ARE NOT actually being sent. I am able to take the flux code being generated (http://localhost:9999/api/v2/notificationRules/ID/query) and run it in the data explorer and a notification is fired as expected.

We will have to investigate what is going on.

aanthony1243 commented 4 years ago

we merged a fix for this recently: https://github.com/influxdata/influxdb/pull/19392

Summary: when a check's data is perfectly aligned on a boundary with the notification rules' schedule, we had a bug that trimmed off both the starting and ending points of the alerting time range. The fix makes sure that one side of the rule is always accepted, thus assuring that no rows are missed.

Unfortunately, it requires opening/saving each notification rule in order to regenerate the correct code. We are looking into migration solutions for users with a large number of notification rules.

ivanpricewaycom commented 4 years ago

ok great news, as soon as beta.17 (docker image) is released we'll be testing this.

tiny-pangolin commented 4 years ago

I was receiving alerts via slack in beta 16 Sep 1, 2020 04:39:43 ivanpricewaycom notifications@github.com:

ok great news, as soon as beta.17 (docker image) is released we'll be testing this.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub[https://github.com/influxdata/influxdb/issues/18769#issuecomment-684671012], or unsubscribe[https://github.com/notifications/unsubscribe-auth/ADFKHBEZPMGOJKLLOAEHNJDSDS6MZANCNFSM4OKME4UA]. [https://github.com/notifications/beacon/ADFKHBEDQ2NBY3FGQK5L5SDSDS6MZA5CNFSM4OKME4UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFDHUAJA.gif]

abhi1693 commented 4 years ago

@aanthony1243 Any ETA on closing this bug and releasing another beta version?

tiny-pangolin commented 4 years ago

This big has been fixed in the latest beta I have been able to receive alerts via pager duty and slack

abhi1693 commented 4 years ago

@tiny6996 There hasn't been a release since Aug 8 for v2. Can you please confirm which version are you referring to?

ivanpricewaycom commented 4 years ago

Given that the fix that @aanthony1243 refers to was merged 24/8 it is clearly not included in the beta 16, @tiny6996 must be referring to a different problem.. I'm still waiting for beta 17 to see if the fix addresses the problem we're experiencing.

russorat commented 4 years ago

@ivanpricewaycom we plan to have another release in the next few weeks. We had to make major changes to the storage and query engines so it has taken longer than we'd like.

tiny-pangolin commented 4 years ago

sorry my pevious comment was based on beta 14. I found a bug in my own ansible deployment playbook were influxdb was not updating

ivanpricewaycom commented 4 years ago

coming back to this issue, we installed rc0 and now have influx receiving >200 messages per second, storing the state of around 12K objects.

we are still observing notifications not being sent, a concrete example:

from(bucket: "_monitoring")
  |> range(start: 2020-10-13T13:40:00Z, stop: 2020-10-13T13:50:00Z)
  |> filter(fn: (r) => r["_measurement"] == "statuses")
  |> filter(fn: (r) => r["line_code"] == "LI-TUYYP")
  |> keep(columns: [
    "_time",
    "_notification_rule_id",
    "_notification_rule_name",
    "_notification_endpoint_id",
    "_notification_endpoint_name",
    "_level",
    "_sent"])
  |> group()
  |> sort(columns: ["_time"], desc: true)

returns the 'correct' timeline:

_time | _level
2020-10-13T13:46:20Z | ok
2020-10-13T13:45:50Z | ok
2020-10-13T13:45:30Z | crit
2020-10-13T13:44:50Z | ok
2020-10-13T13:44:20Z | ok

the following query:

from(bucket: "_monitoring")
  |> range(start: 2020-10-13T13:40:00Z, stop: 2020-10-13T13:50:00Z)
  |> filter(fn: (r) => r["_measurement"] == "notifications")
  |> filter(fn: (r) => r["line_code"] == "LI-TUYYP")
  |> group()

returns only 1 notification: level "ok" at 2020-10-13T13:46:00Z (which we did receive).

_time | _level | _measurement | _sent
2020-10-13T13:46:00Z | ok | notifications | false

So my questions are where is the notification for the CRIT at 13:45:30, and why is _sent = False for the notification that we did receive ?

Or, perhaps more importantly, how can I debug this myself, it seems that the notifications are a bit of a black-box, i'd like to see logs of whether a check is finishing in the appropriate time, if errors are encountered with notif endpoints, is this foreseen in the roadmap ?

thanks for any pointers

-ivan

aanthony1243 commented 4 years ago

@ivanpricewaycom notifications are stored in the metadata as generated flux code. After upgrading it will be necessary to open + re-save the notification rule before the fix is applied. you will have to do this for all your rules.

pavleec commented 3 years ago

Something's still broken even in v.2.0.1. I have a check (running every 5 min) with about 30 series out of which 2 are reporting critical state. obraz

When I have notification rule defined with "When status changes from" (OK > Any or OK > Crit) condition, it's not executed at all. Only "is equal to" condition works properly and send notifications. Howerver similar check with only two series get's executed properly even with status change condition. Anyone has the same observations? Can I debug it somehow?

ivanpricewaycom commented 3 years ago

yo, i feel your pain @pavleec, difficult to know what to do next. I was helped greatly by this post:

https://community.influxdata.com/t/notifications-sent-column-is-false-why/16324/2?u=ivanpricewaycom

and the dashboard link that @Anaisdg provided. It helped me build the correct queries (see above ^) to (kinda) understand where the problem was.. e.g. is the event not being registered, or is the notif not being generated.

there is definitely a problem whereby 'too much' data results in events being silently dropped.

we observed skipped events occurring lots when running every 10s, less every 30, and none every 60s, but these numbers depend entirely on data volume and compute power i suppose. The worrying thing is that (it seems that) those dropped events/notifs really are dropped, there is no log or way to know other than analysing the results of your notif endpoint.

the logs in the GUI are almost unusable as soon as the volume increases, the queries enable a more targetted analysis.

i would like to have time to build a docker project to help reproduce this for the influx devs but i haven't found the time yet.

pavleec commented 3 years ago

@ivanpricewaycom thanks for detailed explanation however I'm not sure if it's the same bug. I've imported "The task summary dashboard" which shows no erros and completion times for all tasks below 1s. Is there any place I could check?

obraz obraz

OlafHaalstra commented 3 years ago

I also experience this problem, my observations: Manually performing the (task) query that is created for the alerts task will trigger a notification. Adding the same query as a task in tasks will make sure that the alerts pop up in the Alert History but will not trigger a notification. Deadman switches do seem to trigger notifications normally created via the Alerts menu, while threshold alerts do not.

I am mostly puzzled by the fact that manually performing the alert query will trigger the notification while performing it as a task will not.

ivanpricewaycom commented 3 years ago

yeah sorry i don't have any other suggestions for you @pavleec , what we need here is better debug visibility on the task / notif system as a whole to understand where the blockages are.

a sandbox environment on a publicly-available influx instance would be useful also to help share the problem with influx devs.

OlafHaalstra commented 3 years ago

After some more debugging the issue seemed to be that we had configured a notification rule that triggered upon: (1) When status changes from OK to ANY Upon creating notifications for: (2) When equal to INFO/CRIT/WARN The notifications are now pushed by rule (2), with some repetition if the status stays the same. The problem seems to be that the query that is created for the notification task returns an empty result and therefor the status never reverts back to OK. I was able to manually trigger notifications when changing the status back to OK first and then to any level that triggered the rule (1). In order to make sure that there always is a value I think we need a combination of:

|> aggregateWindow(.., createEmpty: true)

and

|> fill(value: 0)

Where filling empty results only works with interpolate which is mentioned in this issue.

alespour commented 3 years ago

I believe this https://github.com/influxdata/flux/issues/1877 is related to this.

NZSmartie commented 3 years ago

Hi, this is still a problem. I have a few rules that rely on the status change as reporting when a status equals crit et al, will be too repetitive. However, the alerts do not trigger and no notifications are sent.

danxmoran commented 3 years ago

~This big TODO might be the reason why #19392 didn't fix this issue~ Red herring, looks like that function is unused...

brobotic commented 3 years ago

Just upgraded from 1.x to 2.0.4 (git: 4e7a59bb9a) build_date: 2021-02-08T17:47:02Z and am experiencing this same issue. I can see that they hit crit correctly in the check status, but Slack notifications for OK (or ANY) -> CRIT do not fire off. I do see CRIT -> OK fire notifications, though. Really want to stick on 2.0, but its looking like back to 1.x until this is fixed.

Glokeru commented 3 years ago

same behavior with 2.0.6: OK -> CRIT - does not send notifications CRIT -> OK - sends notifications CRIT - sends notifications

omers commented 3 years ago

Well this has become a real blocker. We are evaluating their managed cloud service, and the whole alerting system just does not work!!

I think that version 2.0 is not ready at all for production usage.

Anyone can advise?

cammurray commented 3 years ago

I've been trying for the past two hours trying different combinations, and alerts are just broken.

I can get equals conditions to fire (e.g state = CRIT), but any "change" condition (e.g ANY to CRIT) just does not want to send a notification. This means that the alerting is basically useless as you need another interim system in-between to filter notifications that have already been sent.

omers commented 3 years ago

Totally useless.

Their managed cloud has the same problem and therefore it is useless too.

Rolled back to 1.8

On Mon, 19 Jul 2021 at 7:51 Cam Murray @.***> wrote:

I've been trying for the past two hours trying different combinations, and alerts are just broken.

I can get equals conditions to fire (e.g state = CRIT), but any "change" condition (e.g ANY to CRIT) just does not want to send a notification. This means that the alerting is basically useless as you need another interim system in-between to filter notifications that have already been sent.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/influxdata/influxdb/issues/18769#issuecomment-882234868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEUB7DSDT6SKLGMFGVF4ZDTYOVORANCNFSM4OKME4UA .

russorat commented 3 years ago

hi all, thank you for your comments. we are very aware that our checks and alerting UI needs some improvement, and we are in the process of making those changes.

if you haven't already, check out this blog post for a detailed description of what's going on behind the scenes: https://www.influxdata.com/blog/influxdbs-checks-and-notifications-system/

long story short, alerts are just customized tasks behind the scenes, and you can customize them however you like. Today, our UI is limited in what you can build, but that should be changing soon.

we have documentation for building custom alerting as well which can also help troubleshoot alerts not firing: https://docs.influxdata.com/influxdb/cloud/monitor-alert/custom-checks/

i understand the frustration with the current setup and we are taking steps to make the process easier. thank you!

TechCiel commented 3 years ago

Hi contributors, and guys here who bothered by this problem,

It's surprised to find the problem I've encountered have a >1 year age. After digging in notification/rule/rule.go , especially the func increaseDur, I've got some thoughts about the cause.

Firstly, my conclusion is: when the interval of a check >= of a notification rule, its status transition might be discarded.

After reading the comment above func increaseDur and have a look of #1877 , I realized that we're filtering check results by interval of rules. Consider a case like we have 1check/1h and 1rule/1s. Every second we'll check statuses within the last 2s according to the code, which will get 0 or 1 records to construct no transition. So the rule never fires.

But when it comes to same interval (the = in >=), things become tricky. Consider we have a check and a rule, both with 5s interval. We will have status at 0 / 5 / 10 / 15 ...s. In this case, the rule will be firing with the check almost simultaneously. If at 10s, the rule query the db before the check is written, the system lose a point at 10s. But will it get both 0s and 5s?

After look into notification/rule/http_test.go, I've found there's a experimental["subDuration"](from: now(), d: 1h)). The check records will be always save as an accurate second with no milliseconds, however, function now() does not. This lead to a mismatch of the point at 0s, with a very littile difference of time, which is the execution time of the rule. This way the point at 0s position in this case will always be filtered out. If the check didn't finish writing status before rule is executed, the rule will fail to fire at this point.

It have been several hours investigating weird behaviors of the monitor system for me to come up with ideas. This is quite a sound hypothesis in a few tests, but it's late in my local, I might just get myself into chaos. I'm not familiar with Go, my apologies for your time wasted if I misunderstood the code.

Thanks you all. <3

umaplehurst commented 3 years ago

@TechCiel Please take a look at https://github.com/influxdata/flux/issues/3807 where I'm proposing a patch for the issue. If you could test that out on your side, it would be useful feedback.

lukasvida commented 2 years ago

any updates on this?

umaplehurst commented 2 years ago

@lukasvida https://github.com/influxdata/flux/issues/3807 was fixed in v2.0.9 so I believe that should resolve some of the problems observed here

lukasvida commented 2 years ago

I have problem with notifications ANY -> OK. My checks were not running (idk why) but i re-wrote them using tasks and now they are changing statuses correctly and at correct intervals. They are threshold checks.

Whenever status changes to CRIT, notification is fired, but when that same status changes back to OK after next task run, notification is not fired.

I'm using two notification rules, one is equals to CRIT and the other one ANY -> OK. The latter one is running with offset 5s larger than the first one, and is not firing sometimes. Any help?

EDIT: I'm using version InfluxDB 2.1.1 (git: 657e1839de) on docker

sawo1337 commented 1 year ago

I'm getting pretty much the same issue with 2.5, I can see states changing, but notifications are rarely sent out. On a test alert that is firing every couple of minutes, the history shows the last notification event over an hour ago. This is a new install, only one event configured. This is the third major issue that I see still open since 2020, took us two days to overcome limitations such as lack of smtp support and no official Teams connector, but looks like this issue is going to be a rollback for us. From what I can see, Influx 2.x is at "take it or leave it" stage for the past two years, I would advise anyone even remotely considering going to 2.x to do full-scale testing first and then migrate. Even ridiculously easy implementations such as smtp support is getting dismissed as exotic feature that was available in Kapacitor, which was supposed to be integrated into Influxdb. Teams is also third party module without guaranteed support, it takes 10 lines of code to implement. The list can go on.

influxdata / influxdb

influxdb2 does not send notifications #18769