Reduce number of retries from 5 to 3 so we don't let repeating issues skip our monitoring.
I'd generally suggest to rewrite record-tester so it publishes prometheus metrics instead of PagerDuty alerts.
Alerts would be based on grafana metrics.
Retrials could produce a metric that would be observed… or record-tester would fail immediately and retry entire test.
Then alerts could be configured to allow short, single failures of record-tester but repeating ones would raise an actual PagerDuty alert.
This is just a thought, it's not in the scope of this PR/task.
Reduce number of retries from 5 to 3 so we don't let repeating issues skip our monitoring.
I'd generally suggest to rewrite record-tester so it publishes prometheus metrics instead of PagerDuty alerts. Alerts would be based on grafana metrics. Retrials could produce a metric that would be observed… or record-tester would fail immediately and retry entire test. Then alerts could be configured to allow short, single failures of record-tester but repeating ones would raise an actual PagerDuty alert. This is just a thought, it's not in the scope of this PR/task.