rchain-community / mainnet-outage

0 stars 0 forks source link

Bad CPU usage alert #13

Closed Bill-Kunj closed 1 year ago

Bill-Kunj commented 1 year ago

Discord alert shows >100% CPU usage on devnet, so there may be something wrong in the metric calculation

Firing

Value: B=106.55205059275494, C=1
Labels:
 - alertname = CPU percentage Test
 - grafana_folder = sample up alert
Annotations:
Source: http://localhost:3000/alerting/grafana/QwDnEXL4k/view?orgId=1
Silence: http://localhost:3000/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DCPU+percentage+Test&matcher=grafana_folder%3Dsample+up+alert
Dashboard: http://localhost:3000/d/9rN2N2LVz?orgId=1
Panel: http://localhost:3000/d/9rN2N2LVz?orgId=1&viewPanel=123127

BOT
Grafana
Firing Value: B=106.55205059275494, C=1 Labels: - alertname = CPU percentage Test - grafana_folder = sample up alert Annotations: Source: http://localhost:3000/alerting/grafana/QwDnEXL4k/view?orgId=1 Silence: http://localhost:3000/alerting/silence/new?alertmanager=grafana&matcher=alertname%3DCPU+percentage+Test&matcher=grafana_folder%3Dsample+up+alert Dashboard: http://localhost:3000/d/9rN2N2LVz?orgId=1 Panel: http://localhost:3000/d/9rN2N2LVz?orgId=1&viewPanel=123127
Strecoza33 commented 1 year ago

A) There were 2 CPU percentage tests, I deleted the 1st one that we did not label which was called "Panel Title." B) For the CPU percentage test I changed the evaluation interval from 10 seconds to 1 minute. The evaluation interval of 10s was too aggressive, it was good for testing the alert but not so good to keep as a final standard setting.

Bill-Kunj commented 1 year ago

I think there's a floating point precision issue somewhere. We may need to round our percentages to two or three digits in order to avoid false alarms @Strecoza33 @azazime

DPMBarnes commented 1 year ago

@DPMBarnes and @azazime believe this is fixed.

DPMBarnes commented 1 year ago

Might need reopening but believed fixed.