Closed rosszurowski closed 2 years ago
Question: wouldn't a relay server outage/issues result in client-facing problems that they might struggle to debug?
For the relay servers: if a relay server is down the clients relatively quickly fail away to the next closest relay. As an example, the Dallas/Fort Worth relays were down for several hours a few weeks ago while the hosting provider they run in upgraded their power distribution. Clients shifted to Chicago and other nearby relays.
+1 for this status page / communication stream (and +1 from my Team) as the sanity check would be much appreciated when we see authentication issues like the ones caused by https://github.com/tailscale/tailscale/issues/4168
We force our users to re-authenticate daily, which, from what we've heard, makes us an outlier. Unfortunately, that also means we're heavily affected by these outages.
Also, because of the nature of our business, we frequently add new devices to the network, which would also be affected by the outages described in this issue's description.
We now have a status page at https://status.tailscale.com which tracks the current status of the main Tailscale components and has a section for ongoing and recent incidents.
A few users have written in asking for a status page indicating whether the coordination server is online or having issues. We've delayed on this, as Tailscale is designed to function even without the control server, so rare outages already have minimal impact. The main restriction is you can't add new devices to your network while the control server is down.
There's still value in having a status page: it's a chance to communicate our resiliency to users, can reassure them that the control server isn't down while they're debugging networking problems, and gives users a place to look when there is an outage.
Our status page will need to distinguish between: