m-lab / mlab-ns

M-Lab name server (load balancer for M-Lab servers)
Apache License 2.0
12 stars 10 forks source link

mlab-ns directs clients to cloud nodes that are not online #223

Closed critzo closed 4 years ago

critzo commented 4 years ago

As part of a piecewise deployment, users reported "stalling" tests. Upon further investigation, clients that appear as stalled have been provided a cloud node from mlab-ns. It serves the hostname but when the client attempts to run a test there is an error since the cloud node is not online. See the error below, taken from the Chrome developer console when loading the page https://speed.digitalinclusion.org

Using M-Lab Server ndt-iupui-mlab1-chs0c.measurement-lab.org
main.js:60 Location received
main.js:25 Test started.  Waiting for connection to server...
main.js:25 WebSocket connection to 'wss://ndt-iupui-mlab1-chs0c.measurement-lab.org:3010/ndt_protocol' failed: Error in connection establishment: net::ERR_NAME_NOT_RESOLVED

mlab-ns / locate should not serve hostnames to any clients for servers that are not online.

stephen-soltesz commented 4 years ago

We believe that chs0c was online briefly yesterday for testing new certificates. Unfortunately, a behavior of mlab-ns is to preserve health status once prometheus stops returning a particular machine in health check queries. It is likely therefore that the node was shutdown quickly and the "offline" status did not have time to propagate before the metrics disappeared from prometheus and mlab-ns preserved the last-seen status as "online".

@robertodauria has removed the site & sliver for this cloud site this morning. It should no longer be returned.

stephen-soltesz commented 4 years ago

@nkinkade FYI.

stephen-soltesz commented 4 years ago

@critzo we believe this is resolved. Please close once you're satisfied it's confirmed.

nkinkade commented 4 years ago

Some history for anyone looking at this issue:

https://docs.google.com/document/d/1g-Jr6OqbeERWb0xHcJ4-kndFEk4vlzFn3GgZVH-lVrg/