m-lab / mlab-ns

M-Lab name server (load balancer for M-Lab servers)
Apache License 2.0
12 stars 10 forks source link

Stop updating status when number of online SliverTools goes below a certain threshold #243

Closed nkinkade closed 3 years ago

nkinkade commented 4 years ago

Presently, if we have some sort of failure in one of our monitoring services that mlab-ns relies on that causes it to return a down status (0), then mlab-ns could potentially mark the entire fleet as down. At a certain threshold mlab-ns should stop marking SliverTools as down. This threshold should be something below the the threshold for the alert for too many NDT servers being down, so that we still receive alerts that a problem exists.

nkinkade commented 3 years ago

I've added an agenda item to the next Engineering meeting (2021-04-07) to discuss this.