Fault-tolerant cluster management/coordination

With Vertigo's core API complete, I'm beginning to take a look at the deficiencies of Vertigo's current design. There are many ambitious plans for the future - strongly ordered messaging, exactly-once processing, stateful components, dynamic configuration changes, and high level operations - but I feel there are some structural changes that need to be made before any of these excellent features can be realized.

In particular, a lot of these features require a fault-tolerant foundation. In order to properly guarantee things like exactly-once processing, Vertigo must be able to guarantee that information about messages won't be lost in a failure. Currently, Vertigo cannot provide fault-tolerance, as its only clustering mechanism - Via - itself lacks a fault-tolerant guarantee. Additionally, most of Vertigo's cluster coordination takes place in a single coordinator verticle which represents a single point-of-failure for all networks.

I am in the process of creating a fault-tolerant cluster manager/coordinator for Vertigo.

The goal of the cluster manager will be to support deploying network components across a cluster of Vert.x instances and manage redeployment of those components when a node fails. In order to ensure failover is properly handled in a distributed system - only one node attempts to redeploy a failed node's deployments - the cluster will be coordinated by leader election. In fact, all requests to the cluster will be directed to a cluster leader, ensuring single-copy consistency, because in the context of a precise framework like Vertigo, it could be detrimental to a network to have any type of inconsistent information as it related to component deployments.

The cluster will also be used for network coordination. As I mentioned, currently cluster coordination takes place in a single fallible verticle - the coordinator. But the new cluster will act as a fault-tolerant coordinator by exposing a simple key-value store with features for exposing cluster events to Vertigo components. For instance, one purpose of the data store could be as a vehicle for components to heartbeat one another. Periodically, each component would set a key in the data store. On the other end, each component would watch the keys of the other components in its network. Once all the keys in the network have been set, the components know that it's safe to begin emitting messages. If a key times out, the components know that another component in the network may have failed. Providing this type of coordination via the cluster helps improve the availability of networks and opens the door for much more finely grained control.

One of the earliest features the cluster will likely be tasked for is storing and updating network configurations. The publish-subscribe based system currently used in Vertigo can potentially be unreliable during failures. Rather than require components to actually subscribe to the output of other components, Vertigo will instead coordinate connections between components via the cluster. This will ensure that even if a component fails, the component's connections will not be closed, and messages will be properly timed out during the failure. Additionally, components can watch component connection information, updating their internal connection contexts when changes occur. This means networks can be dynamically reconfigured simply by updating a network configuration in the cluster.

This should be going into the next big release. Currently, the fault-tolerant cluster manager and basic data store with events have all been implemented in the cluster-management branch. Only the component coordination needs to be moved to the cluster.

kuujo / vertigo

Fault-tolerant cluster management/coordination #28