Closed critzo closed 4 years ago
We believe that chs0c was online briefly yesterday for testing new certificates. Unfortunately, a behavior of mlab-ns is to preserve health status once prometheus stops returning a particular machine in health check queries. It is likely therefore that the node was shutdown quickly and the "offline" status did not have time to propagate before the metrics disappeared from prometheus and mlab-ns preserved the last-seen status as "online".
@robertodauria has removed the site & sliver for this cloud site this morning. It should no longer be returned.
@nkinkade FYI.
@critzo we believe this is resolved. Please close once you're satisfied it's confirmed.
Some history for anyone looking at this issue:
https://docs.google.com/document/d/1g-Jr6OqbeERWb0xHcJ4-kndFEk4vlzFn3GgZVH-lVrg/
As part of a piecewise deployment, users reported "stalling" tests. Upon further investigation, clients that appear as stalled have been provided a cloud node from mlab-ns. It serves the hostname but when the client attempts to run a test there is an error since the cloud node is not online. See the error below, taken from the Chrome developer console when loading the page https://speed.digitalinclusion.org
mlab-ns / locate should not serve hostnames to any clients for servers that are not online.