Alert queries should automatically wait for all queries

sysadmin1139 commented 7 months ago

What happened:

We updated to 2.9.1 last week, which was fine. Earlier this week we made some adjustments to our metrics ingestion which resulted in a larger volume entering Timestream tables used in alerting. Alarms hitting our most increased tables started seeing sporadic NoData errors. Turning debug logging on didn't give any gun-smoke.

logger=ngalert.scheduler rule_uid=_Nu3p9V4z org_id=1 version=10 attempt=0 now=2024-02-08T18:16:00Z t=2024-02-08T18:16:07.472538875Z level=debug msg="Alert rule evaluated" results="[{Instance: State:Normal Error:<nil> Results:map[] Values:map[] EvaluatedAt:2024-02-08 18:16:00 +0000 UTC EvaluationDuration:7.472534108s EvaluationString:}]" duration=1.023625401s
logger=ngalert.scheduler rule_uid=_Nu3p9V4z org_id=1 version=10 attempt=0 now=2024-02-08T18:15:00Z t=2024-02-08T18:15:12.870256927Z level=debug msg="Alert rule evaluated" results="[{Instance: State:NoData Error:<nil> Results:map[] Values:map[A0:{Var:A Labels: Value:<nil>}] EvaluatedAt:2024-02-08 18:15:00 +0000 UTC EvaluationDuration:12.870241593s EvaluationString:[ var='A0' metric='NoData' labels={} value=null ]}]" duration=5.247031669s

The 18:15 evaluation triggered a NoData, where the 18:16 evaluation cleared it. This flips minute by minute, sometimes going as long as 6 minutes of success before another NoData.

CloudWatch isn't showing elevations in UserErrors or SystemErrors for Queries.
Queries are taking longer because there is more data to chew through.
Turning on debug logging made the problem much worse.
Downgrading to 2.9.0 eased it a bit, but the problem is still very much there.

The change in 9.5 to only fire NoData after the For period has elapsed would have obscured this from visibility.

This feels like something internal to the grafana process is getting overwhelmed.

What you expected to happen:

It should have behaved as it formerly did, doing alert queries without issue.

How to reproduce it (as minimally and precisely as possible):

Screenshots CloudWatch table ingestion Title and series-names are deliberately obscured, but this is charting SampleCount for a variety of Timestream tables. Pixel resolution in this scale is 5 minutes. The alert flapping issue started shortly after the increase in ingestion rate.

Anything else we need to know?:

Environment:

Grafana version: 9.3.8
Plugin version: 2.9.1
OS Grafana is installed on: Ubuntu 20.04
User OS & Browser: N/A
Others:
- Total timestream alerts: 196
- Total alerts: 419
- 2 CPU box, with usage staying well below 25%

sysadmin1139 commented 7 months ago

Full event-log for the failing 18:15 event.

logger=datasources rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:07.62330725Z level=debug msg="Querying for data source via SQL store" uid=S_Kqe5e7k orgId=1
logger=ngalert.state.manager rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.870653042Z level=debug msg="State manager processing evaluation results" resultCount=1
logger=ngalert.scheduler rule_uid=_Nu3p9V4z org_id=1 version=10 attempt=0 now=2024-02-08T18:15:00Z t=2024-02-08T18:15:12.870256927Z level=debug msg="Alert rule evaluated" results="[{Instance: State:NoData Error:<nil> Results:map[] Values:map[A0:{Var:A Labels: Value:<nil>}] EvaluatedAt:2024-02-08 18:15:00 +0000 UTC EvaluationDuration:12.870241593s EvaluationString:[ var='A0' metric='NoData' labels={} value=null ]}]" duration=5.247031669s
logger=expr rule_uid=_Nu3p9V4z org_id=1 datasourceType=grafana-timestream-datasource t=2024-02-08T18:15:12.870148906Z level=debug msg="expression datasource query (seriesSet)" query=B
logger=ngalert.state.manager rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.871205061Z level=debug msg="Saving alert states" count=1
logger=ngalert.state.manager rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.871138996Z level=debug msg="Setting alert state"
logger=ngalert.state.historian rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.874227359Z level=debug msg="Alert state changed creating annotation" newState="Normal (NoData)" oldState=Normal
logger=ngalert.state.historian rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.874466849Z level=error msg="Error saving alert annotation batch" error="json: error calling MarshalJSON for type *simplejson.Json: json: unsupported value: NaN"
logger=ngalert.state.historian rule_uid=_Nu3p9V4z org_id=1 t=2024-02-08T18:15:12.874597227Z level=debug msg="Done saving alert annotation batch"

iwysiu commented 6 months ago

Hi @sysadmin1139 , I've picked this up to take a look at it. What I suspect based on what you've said, is that we're hitting a timeout in either grafana or in timestream. I've looked at the changes between the 2.9.0 and 2.9.1 versions, and I'm not seeing anything that would affect the query runtime. I'm going to look at it a bit more this week and see if there's anything we can do to improve this, but can you give me the alert rule you're running (with any sensitive information removed)? Also, relating to my timeout theory, does it error less if you make the period between evaluations longer?

sysadmin1139 commented 6 months ago

It turns out the 2.9.0 vs 2.9.1 thing was in error. I've since downgraded to 2.9.0 and the effect is still present, though a bit reduced. If there is a timeout in Timestream, it's not showing up in the CloudWatch error metrics. I also don't know how to tune debug output to get more targeted data, if that would help to have.

sysadmin1139 commented 6 months ago

My current working theory is this is some kind of thundering herd problem, where smearing would be more effective than spacing out the avalanche of queries.

Today I tried making a second timestream datasource, under the theory that might change the contention problem. It did not.

iwysiu commented 6 months ago

Yeah, I agree that if it ends up being a timeout error we should give a more helpful message.

Can you give me the query (with any confidential information removed) and expressions you have set for the alert? It'll help me understand what's happening with it.

Also, you tell me what happens if you try setting the evaluation interval to be longer? It's possible that Grafana is cancelling the query after it waits the minute of the evaluation interval, which could possibly result in a NoData without causing an error as far as Timestream is concerned.

sysadmin1139 commented 6 months ago

The logging I quoted in the original report showed query times in small number seconds. I can sometimes reproduce this behavior in the 'Edit alert' screen with the 'preview' button. It gets no data. Refresh, data. Refresh, data. Refresh, no data.

iwysiu commented 6 months ago

Do you have waitForAllQueries toggled to true for your alerting query? By default we return partial queries from timestream, but alerting requires that we wait to return the results. If that's the issue, we should update the plugin to wait for alert queries automatically, but the toggle should fix it for now. Screenshot 2024-02-16 at 6 25 10 PM

sysadmin1139 commented 6 months ago

I did not have that set on my exemplar bad alert. I've set it and will let it bake a few hours and see what happens.

sysadmin1139 commented 6 months ago

It's been the long weekend and the alarm hasn't flapped once. I think we found root cause.

iwysiu commented 6 months ago

Great! It sounds like we have a workaround for now, but I'll move this to our backlog for making alert queries automatically wait.

Bug Notes: We should check the request headers for the alert header and if it's applied, we should set WaitForResult on the query to true (example of checking the request headers in cloudwatch)

sysadmin1139 commented 6 months ago

Another thing we've noticed: that selector doesn't visually stay selected after you save the alarm and leave, but definitely is effective.

iwysiu commented 6 months ago

Thanks! I made a separate issue for that https://github.com/grafana/timestream-datasource/issues/276

sysadmin1139 commented 6 months ago

I saw this particular stack trace a few times before, but wasn't able to reproduce it. This happens when Previewing timestream alert queries, but only sometimes.

https://url/alerting/MHKT--ZIz/edit?returnTo=%2Falerting%2Fgrafana%2FMHKT--ZIz%2Fview%3ForgId%3D1

TypeError: e.fields[0] is undefined

di@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:145:386
div
de@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:92534
vi@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:252:158
div
de@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:92534
fieldset
d@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1406:1760
div
div
o@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:6346:172
No@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:292:2635
div
div
div
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:711589
p@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:424:3558
div
form
R@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:927698
$o@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:292:5853
Uo@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:313:67
div
37417/b.Contents@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:5810:2272
div
div
div
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:711589
p@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:424:3558
div
b@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:5810:1289
t@https://fqdn/public/build/9076.637f4fd32666ef6824c7.js:1:211
7328/Vo<@https://fqdn/public/build/AlertingRuleForm.d245068ee1365350e124.js:313:914
c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:301
u@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:855
WithErrorBoundary
Suspense
c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:301
_c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:8842:350
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1060861
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1062882
main
Cd@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:8803:6825
t@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1057479
div
o@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:2006:4371
19175/t.KBarProvider@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:271053
l@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:6154:19566
c@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:301
u@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:1207:855
l@https://fqdn/public/build/3217.ea2a014a59d99960d7a7.js:2:1040511
Su@https://fqdn/public/build/2362.d06d6cb2bdc4890cef9d.js:8855:115

katebrenner commented 5 months ago

closing this since we have a workaround, but still plan to address https://github.com/grafana/timestream-datasource/issues/276

grafana / timestream-datasource

Alert queries should automatically wait for all queries #275