Deploy Waypoint in a highly available fashion

josegonzalez commented 2 years ago

It would be great to be able to run Waypoint in a highly available fashion. Currently, there is a database file that is stored on disk that contains the state of Waypoint, but this needs to be persisted on some external storage system in order to ensure that a loss of a node does not result in a loss of deployment state. In many cases, this will mean that the loss of a data center in a multi-data center installation -such as in AWS where a region is composed of many availability zones.

It would be great to be able to run several copies of Waypoint, with a primary replicating it’s state to the other instances. Some data does not have to be persisted - such as log information, if that is stored in the data.db dump - if that makes the implementation simple/faster. Ideally we could deploy this in a similar fashion to Consul/Normad/Vault, where state is stored in raft and you can have a clustered install that can withstand the loss of one or more instances.

This is a significant blocker to even the idea of using Waypoint in our environment, as being able to guarantee that we know the past state of deploys during a general service outage situation is a key requirement of adopting any deployment system.

briancain commented 2 years ago

Hi @josegonzalez,

Thanks for asking a great question. A highly available Waypoint server is something we have been thinking about and planning for as well. I'm going to separately address scalability from availability in this post.

On scalability: today we plan to continue focusing on making Waypoint “single-server scalable” (vertically scalable) and improve any areas where it fails to live up to that expectation. This is similar to our initial approach with Vault, which was only vertically scalable for the first many years of its life. Unlike Vault, however, Waypoint is not in the critical runtime path for running the application, only for the deployment and release for applications. In that sense, we expect the load Waypoint will see to be much lower than something like Vault and given the scale we achieved with single server Vault for so long, we're confident we can do the same with Waypoint. Scaling the logs and exec feature is a separate challenge/concern that we hope to address in the future. (Note: we're not saying we'll never have horizontal scalability, just that in the short term it's not a focus for us and why that is)

On availability: We are concerned with making sure Waypoint can be recovered from any downtime as quickly as possible. We’ve architected Waypoint server and all of its components to be able to gracefully sustain downtime as long as the Mean Time To Repair is low. In other words, existing applications with entrypoints all fail static and new instances can’t launch while Waypoint is down. Given the background in the scalability section we believe, we can have more tolerance around Waypoint downtime in worst case scenarios (i.e. low minutes rather than seconds) since Waypoint isn't directly in the runtime path.

Initially, we feel like utilizing a persistence mechanism outside of Waypoint that can replicate the data.db file and restart Waypoint quickly is a practical tradeoff versus implementing our own replication in Waypoint in the state that the project is in today. We do not have anything documented about this currently, but we really should and could use this issue as a way to track the feature request. In doing so, we can show how to rely on existing replication solutions for data resilience without having Waypoint required to handle it on its own. For example, when Waypoint is installed on Kubernetes, we use a StatefulSet which does a good job of recovering itself relatively quickly.

As you pointed out in this issue, in the future we can look into "hot standby" approaches instead. We purposely chose our internal technologies to match our other products (Nomad, Consul, Vault) because we are able to layer a Raft layer on top if we need to in order to do the replication. This just isn't something we are doing right now, but we aren't precluding that from the future.

I hope that makes some sense and answers your question. Thanks again for the feature request!

koolay commented 2 years ago

Will you try litestream? @briancain

acaloiaro commented 2 years ago

@koolay Litestream is built for SQLite, and Waypoint's database is bolt.

koolay commented 2 years ago

@acaloiaro I means the SQLite is better, it is simple and can work as single node and distribution like litefs.

hashicorp / waypoint

Deploy Waypoint in a highly available fashion #2785