Fault-tolerance - Githubissues

The primary goal of federation is a robust delivery network. I would rather federation was a strong platform upon which one could build a fault-tolerant setup than build it in natively. Think of it as the IP layer, upon which a good TCP layer could be built.

The goals are inline with one and other, but I think layering is a better approach to mixing.

Currently there are possible scenarios where messages can be dropped without good error handling. The scenarios are not inherent to federation, rather from mis-configured routes. I would want to strengthen the error bubbling before designing a fault-tolerant system around federation. That is definitely planned however.

I think a high degree of fault tolerance can be achieved using a few design principals which federation would work well with:

Make worker processes stateless. Workers should process a request, and either reply or commit persisted data to a database. No internal state exists between requests.
Cluster workers. Use a cluster of workers, so that if one dies, new requests are not affected.
Make errors visible and explicit. This is the shit happens clause. Sometimes requests fail, make sure that failure is seen at the request origin, even if that means returning a 500 page to the client.
Use orderless startup. It shouldn't matter what order processes are started.
Use independent process control. Keep processes alive with Upstart or the native initd system. When a process dies, current requests will error out, but new should not be affected.

Federation, with better error propagation, would work well in the above design. As long as errors bubble upward towards the request origin, each link in the chain can decide how to handle errors allowing you to introduce robustness where it makes the most sense.

groundwater / federation

Fault-tolerance #6