groundwater / federation

A Federated Message Network in Node.js
http://underflow.ca/federation
59 stars 5 forks source link

Fault-tolerance #6

Open mcandre opened 11 years ago

mcandre commented 11 years ago

I'm working on a Node web app that continually makes calls to a dozen APIs, and the calls often crash. I'd love to use something like actors to automatically restart the crashed calls, but I haven't found a Node package that offers fault-tolerance. Federation looks promising, but I don't see fault-tolerance anywhere in the documentation. Are there plans to add this feature?

groundwater commented 11 years ago

The primary goal of federation is a robust delivery network. I would rather federation was a strong platform upon which one could build a fault-tolerant setup than build it in natively. Think of it as the IP layer, upon which a good TCP layer could be built.

The goals are inline with one and other, but I think layering is a better approach to mixing.

Currently there are possible scenarios where messages can be dropped without good error handling. The scenarios are not inherent to federation, rather from mis-configured routes. I would want to strengthen the error bubbling before designing a fault-tolerant system around federation. That is definitely planned however.

I think a high degree of fault tolerance can be achieved using a few design principals which federation would work well with:

  1. Make worker processes stateless. Workers should process a request, and either reply or commit persisted data to a database. No internal state exists between requests.
  2. Cluster workers. Use a cluster of workers, so that if one dies, new requests are not affected.
  3. Make errors visible and explicit. This is the shit happens clause. Sometimes requests fail, make sure that failure is seen at the request origin, even if that means returning a 500 page to the client.
  4. Use orderless startup. It shouldn't matter what order processes are started.
  5. Use independent process control. Keep processes alive with Upstart or the native initd system. When a process dies, current requests will error out, but new should not be affected.

Federation, with better error propagation, would work well in the above design. As long as errors bubble upward towards the request origin, each link in the chain can decide how to handle errors allowing you to introduce robustness where it makes the most sense.