Open omers opened 4 years ago
Same issue.
But it seems that only occurs with thresholds alert, at least for me.
InfluxDB 2.0.0-beta.12 (git: ff620782eb) build_date: 2020-07-01T10:38:19Z
In some case seems that it's random some check are sended other not.
Check works well, i can see the row in the check's history but I don't see the correlated row in notification history.
Same issue. influxdb version:
influxdb_2.0.0-beta.13_linux_amd64
The very same issue here. :disappointed:
# influxd2 version
InfluxDB 2.0.0-beta.13 (git: 86796ddf2d) build_date: 2020-07-08T11:07:50Z
I don't see anything in the notification tab (but checks seem to work). It also seems that alerts are beying scheduled somehow (at least the last running time seems to get updated regularly) and perhaps even run (?) but nothing really happens. I am trying the HTTP PUSH method, just to clarify it more. And of course, nothing special gets logged... :cry:
How to debug it to know more, please? :grin:
Any feedback from influx team?
@omers No.
I had encountered the same problem during the past week. The notification rules seem to work only for checks created before the rules was created. Newly made checks don't fire notifications.
To "fix" it, you have to do a PUT (PATCH doesn't work) request on the rule, with the same data, effectivly overriding it with itself and that seems to refresh the list of checks that the rule is looking for.
+1
I see the same error on my end as I've reported here https://community.influxdata.com/t/notification-when-status-changes-from-ok-crit/13847/9 Tried multiple configs:
+1 we're running version=2.0.0-beta.15, statuses are being written by checks, the statuses are changing, but notifications using the 'changes from' type are not firing.
notifications for 'is equal to' work as expected.
i tried 'reputting' the rule as per @Pupix 's suggestion (via the UI), makes no difference.
it seems this is the exact same problem as: https://github.com/influxdata/influxdb/issues/17809
seems to me the issue should be reopened...
-ivan
adding to my previous comment, after playing with different intervals for the notification rule and check interval i believe that this is related to:
https://github.com/influxdata/influxdb/issues/18284
indeed setting the notification rule to ~1m9s, with a check interval of 30s, ensures that there is 'always' (almost) a check that has run in the notification window, and i am receiving state change notifs (via http).
this setup seems fragile and i'm not sure we're able to rely on it for production use however... will continue experimenting.
following on from this, we have now modified our config so that the checks run every 10s, and the notification every 30s, but there are still notifications being missed.
executing the following:
import "influxdata/influxdb/monitor"
monitor.from(start: -30s)
|> filter(fn: (r) => r["line_code"] == "LI-XXXX")
|> monitor.stateChanges(toLevel: "crit")
results in the 'correct' list of state changes, but not all of these changes result in a notification.
is there a way to have more debug information for the notification sending, to try to understand where the ones that are not being sent are failing ?
@ivanpricewaycom I couldn't find any option to debug alerts and rules. So I guess we need to wait for a response from Influx team
I've observed the same issue in my setup with 'changes from' notification rule. For 'is equal to' notifications are sent to the http endpoint (however, interestingly I am always receiving the same event 3 times with the same timestamp).
I am still trying to find out if workaround @ivanpricewaycom works. Unfortunately, without much successes until now :(.
I am using beta.16 version.
# influxd version InfluxDB 2.0.0-beta.16 (git: 50964d732c) build_date: 2020-08-07T20:18:07Z
@mhall119 Any ETA on fixing this issue? I have confirmed and the alerts and notifications are working on 2.0.0-beta.9 branch. Is there a way that I can build a docker image for that tag?
@abhi1693 right now all efforts are on finishing the storage engine change, after than I think there is a new version of Flux that's ready to be added which might have a fix for this.
@mhall119 Thanks for replying so soon. Is there a timeline on this?
The storage engine change is currently in the works, I think that's supposed to land in the next week or so. After than I'm not as sure on the schedule, bug you can ask in our Slack in the #influxdb-v2 channel
I am able to confirm the following in the current beta-16 build:
We will have to investigate what is going on.
we merged a fix for this recently: https://github.com/influxdata/influxdb/pull/19392
Summary: when a check's data is perfectly aligned on a boundary with the notification rules' schedule, we had a bug that trimmed off both the starting and ending points of the alerting time range. The fix makes sure that one side of the rule is always accepted, thus assuring that no rows are missed.
Unfortunately, it requires opening/saving each notification rule in order to regenerate the correct code. We are looking into migration solutions for users with a large number of notification rules.
ok great news, as soon as beta.17 (docker image) is released we'll be testing this.
I was receiving alerts via slack in beta 16 Sep 1, 2020 04:39:43 ivanpricewaycom notifications@github.com:
ok great news, as soon as beta.17 (docker image) is released we'll be testing this.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub[https://github.com/influxdata/influxdb/issues/18769#issuecomment-684671012], or unsubscribe[https://github.com/notifications/unsubscribe-auth/ADFKHBEZPMGOJKLLOAEHNJDSDS6MZANCNFSM4OKME4UA]. [https://github.com/notifications/beacon/ADFKHBEDQ2NBY3FGQK5L5SDSDS6MZA5CNFSM4OKME4UKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOFDHUAJA.gif]
@aanthony1243 Any ETA on closing this bug and releasing another beta version?
This big has been fixed in the latest beta I have been able to receive alerts via pager duty and slack
@tiny6996 There hasn't been a release since Aug 8 for v2. Can you please confirm which version are you referring to?
Given that the fix that @aanthony1243 refers to was merged 24/8 it is clearly not included in the beta 16, @tiny6996 must be referring to a different problem.. I'm still waiting for beta 17 to see if the fix addresses the problem we're experiencing.
@ivanpricewaycom we plan to have another release in the next few weeks. We had to make major changes to the storage and query engines so it has taken longer than we'd like.
sorry my pevious comment was based on beta 14. I found a bug in my own ansible deployment playbook were influxdb was not updating
coming back to this issue, we installed rc0 and now have influx receiving >200 messages per second, storing the state of around 12K objects.
we are still observing notifications not being sent, a concrete example:
from(bucket: "_monitoring")
|> range(start: 2020-10-13T13:40:00Z, stop: 2020-10-13T13:50:00Z)
|> filter(fn: (r) => r["_measurement"] == "statuses")
|> filter(fn: (r) => r["line_code"] == "LI-TUYYP")
|> keep(columns: [
"_time",
"_notification_rule_id",
"_notification_rule_name",
"_notification_endpoint_id",
"_notification_endpoint_name",
"_level",
"_sent"])
|> group()
|> sort(columns: ["_time"], desc: true)
returns the 'correct' timeline:
_time | _level
2020-10-13T13:46:20Z | ok
2020-10-13T13:45:50Z | ok
2020-10-13T13:45:30Z | crit
2020-10-13T13:44:50Z | ok
2020-10-13T13:44:20Z | ok
the following query:
from(bucket: "_monitoring")
|> range(start: 2020-10-13T13:40:00Z, stop: 2020-10-13T13:50:00Z)
|> filter(fn: (r) => r["_measurement"] == "notifications")
|> filter(fn: (r) => r["line_code"] == "LI-TUYYP")
|> group()
returns only 1 notification: level "ok" at 2020-10-13T13:46:00Z (which we did receive).
_time | _level | _measurement | _sent
2020-10-13T13:46:00Z | ok | notifications | false
So my questions are where is the notification for the CRIT at 13:45:30, and why is _sent = False for the notification that we did receive ?
Or, perhaps more importantly, how can I debug this myself, it seems that the notifications are a bit of a black-box, i'd like to see logs of whether a check is finishing in the appropriate time, if errors are encountered with notif endpoints, is this foreseen in the roadmap ?
thanks for any pointers
-ivan
@ivanpricewaycom notifications are stored in the metadata as generated flux code. After upgrading it will be necessary to open + re-save the notification rule before the fix is applied. you will have to do this for all your rules.
Something's still broken even in v.2.0.1. I have a check (running every 5 min) with about 30 series out of which 2 are reporting critical state.
When I have notification rule defined with "When status changes from" (OK > Any or OK > Crit) condition, it's not executed at all. Only "is equal to" condition works properly and send notifications. Howerver similar check with only two series get's executed properly even with status change condition. Anyone has the same observations? Can I debug it somehow?
yo, i feel your pain @pavleec, difficult to know what to do next. I was helped greatly by this post:
https://community.influxdata.com/t/notifications-sent-column-is-false-why/16324/2?u=ivanpricewaycom
and the dashboard link that @Anaisdg provided. It helped me build the correct queries (see above ^) to (kinda) understand where the problem was.. e.g. is the event not being registered, or is the notif not being generated.
there is definitely a problem whereby 'too much' data results in events being silently dropped.
we observed skipped events occurring lots when running every 10s, less every 30, and none every 60s, but these numbers depend entirely on data volume and compute power i suppose. The worrying thing is that (it seems that) those dropped events/notifs really are dropped, there is no log or way to know other than analysing the results of your notif endpoint.
the logs in the GUI are almost unusable as soon as the volume increases, the queries enable a more targetted analysis.
i would like to have time to build a docker project to help reproduce this for the influx devs but i haven't found the time yet.
@ivanpricewaycom thanks for detailed explanation however I'm not sure if it's the same bug. I've imported "The task summary dashboard" which shows no erros and completion times for all tasks below 1s. Is there any place I could check?
I also experience this problem, my observations: Manually performing the (task) query that is created for the alerts task will trigger a notification. Adding the same query as a task in tasks will make sure that the alerts pop up in the Alert History but will not trigger a notification. Deadman switches do seem to trigger notifications normally created via the Alerts menu, while threshold alerts do not.
I am mostly puzzled by the fact that manually performing the alert query will trigger the notification while performing it as a task will not.
yeah sorry i don't have any other suggestions for you @pavleec , what we need here is better debug visibility on the task / notif system as a whole to understand where the blockages are.
a sandbox environment on a publicly-available influx instance would be useful also to help share the problem with influx devs.
After some more debugging the issue seemed to be that we had configured a notification rule that triggered upon:
(1) When status
changes from
OK
to
ANY
Upon creating notifications for:
(2) When equal
to
INFO/CRIT/WARN
The notifications are now pushed by rule (2), with some repetition if the status stays the same. The problem seems to be that the query that is created for the notification task returns an empty result and therefor the status never reverts back to OK
. I was able to manually trigger notifications when changing the status back to OK
first and then to any level that triggered the rule (1). In order to make sure that there always is a value I think we need a combination of:
|> aggregateWindow(.., createEmpty: true)
and
|> fill(value: 0)
Where filling empty results only works with interpolate
which is mentioned in this issue.
I believe this https://github.com/influxdata/flux/issues/1877 is related to this.
Hi, this is still a problem. I have a few rules that rely on the status change as reporting when a status equals crit
et al, will be too repetitive. However, the alerts do not trigger and no notifications are sent.
~This big TODO might be the reason why #19392 didn't fix this issue~ Red herring, looks like that function is unused...
Just upgraded from 1.x to 2.0.4 (git: 4e7a59bb9a) build_date: 2021-02-08T17:47:02Z
and am experiencing this same issue. I can see that they hit crit correctly in the check status, but Slack notifications for OK (or ANY) -> CRIT
do not fire off. I do see CRIT -> OK
fire notifications, though. Really want to stick on 2.0, but its looking like back to 1.x until this is fixed.
same behavior with 2.0.6:
OK -> CRIT
- does not send notifications
CRIT -> OK
- sends notifications
CRIT
- sends notifications
Well this has become a real blocker. We are evaluating their managed cloud service, and the whole alerting system just does not work!!
I think that version 2.0 is not ready at all for production usage.
Anyone can advise?
I've been trying for the past two hours trying different combinations, and alerts are just broken.
I can get equals conditions to fire (e.g state = CRIT), but any "change" condition (e.g ANY to CRIT) just does not want to send a notification. This means that the alerting is basically useless as you need another interim system in-between to filter notifications that have already been sent.
Totally useless.
Their managed cloud has the same problem and therefore it is useless too.
Rolled back to 1.8
On Mon, 19 Jul 2021 at 7:51 Cam Murray @.***> wrote:
I've been trying for the past two hours trying different combinations, and alerts are just broken.
I can get equals conditions to fire (e.g state = CRIT), but any "change" condition (e.g ANY to CRIT) just does not want to send a notification. This means that the alerting is basically useless as you need another interim system in-between to filter notifications that have already been sent.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/influxdata/influxdb/issues/18769#issuecomment-882234868, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEUB7DSDT6SKLGMFGVF4ZDTYOVORANCNFSM4OKME4UA .
hi all, thank you for your comments. we are very aware that our checks and alerting UI needs some improvement, and we are in the process of making those changes.
if you haven't already, check out this blog post for a detailed description of what's going on behind the scenes: https://www.influxdata.com/blog/influxdbs-checks-and-notifications-system/
long story short, alerts are just customized tasks behind the scenes, and you can customize them however you like. Today, our UI is limited in what you can build, but that should be changing soon.
we have documentation for building custom alerting as well which can also help troubleshoot alerts not firing: https://docs.influxdata.com/influxdb/cloud/monitor-alert/custom-checks/
i understand the frustration with the current setup and we are taking steps to make the process easier. thank you!
Hi contributors, and guys here who bothered by this problem,
It's surprised to find the problem I've encountered have a >1 year age. After digging in notification/rule/rule.go
, especially the func increaseDur
, I've got some thoughts about the cause.
Firstly, my conclusion is: when the interval of a check >= of a notification rule, its status transition might be discarded.
After reading the comment above func increaseDur
and have a look of #1877 , I realized that we're filtering check results by interval of rules. Consider a case like we have 1check/1h and 1rule/1s. Every second we'll check statuses within the last 2s according to the code, which will get 0 or 1 records to construct no transition. So the rule never fires.
But when it comes to same interval (the = in >=), things become tricky. Consider we have a check and a rule, both with 5s interval. We will have status at 0 / 5 / 10 / 15 ...s. In this case, the rule will be firing with the check almost simultaneously. If at 10s, the rule query the db before the check is written, the system lose a point at 10s. But will it get both 0s and 5s?
After look into notification/rule/http_test.go
, I've found there's a experimental["subDuration"](from: now(), d: 1h))
. The check records will be always save as an accurate second with no milliseconds, however, function now()
does not. This lead to a mismatch of the point at 0s, with a very littile difference of time, which is the execution time of the rule. This way the point at 0s position in this case will always be filtered out. If the check didn't finish writing status before rule is executed, the rule will fail to fire at this point.
It have been several hours investigating weird behaviors of the monitor system for me to come up with ideas. This is quite a sound hypothesis in a few tests, but it's late in my local, I might just get myself into chaos. I'm not familiar with Go, my apologies for your time wasted if I misunderstood the code.
Thanks you all. <3
@TechCiel Please take a look at https://github.com/influxdata/flux/issues/3807 where I'm proposing a patch for the issue. If you could test that out on your side, it would be useful feedback.
any updates on this?
@lukasvida https://github.com/influxdata/flux/issues/3807 was fixed in v2.0.9 so I believe that should resolve some of the problems observed here
I have problem with notifications ANY -> OK
.
My checks were not running (idk why) but i re-wrote them using tasks and now they are changing statuses correctly and at correct intervals. They are threshold checks.
Whenever status changes to CRIT
, notification is fired, but when that same status changes back to OK
after next task run, notification is not fired.
I'm using two notification rules, one is equals to CRIT
and the other one ANY -> OK
. The latter one is running with offset 5s larger than the first one, and is not firing sometimes. Any help?
EDIT: I'm using version InfluxDB 2.1.1 (git: 657e1839de)
on docker
I'm getting pretty much the same issue with 2.5, I can see states changing, but notifications are rarely sent out. On a test alert that is firing every couple of minutes, the history shows the last notification event over an hour ago. This is a new install, only one event configured. This is the third major issue that I see still open since 2020, took us two days to overcome limitations such as lack of smtp support and no official Teams connector, but looks like this issue is going to be a rollback for us. From what I can see, Influx 2.x is at "take it or leave it" stage for the past two years, I would advise anyone even remotely considering going to 2.x to do full-scale testing first and then migrate. Even ridiculously easy implementations such as smtp support is getting dismissed as exotic feature that was available in Kapacitor, which was supposed to be integrated into Influxdb. Teams is also third party module without guaranteed support, it takes 10 lines of code to implement. The list can go on.
I setup influxdb2 to send alerts to our slack channel. I see only alerts that were sent 4 days ago and no alerts received to slack channel
for debugging, I created a flask app to listen as an http endpoint, and created an http notification rule to route all the alerts to the flask app. I see no requests
Steps to reproduce:
Expected behavior: Alerts will be delievered
Actual behavior: Describe What actually happened.
Environment info:
Config: /usr/sbin/influxd run --engine-path /influx/engine --bolt-path /influx/boltdb.db --http-bind-address 127.0.0.1:9999 --log-level info
The last time alerts were sent was 4 days ago: