pongasoft / glu

Deployment Automation Platform
Apache License 2.0
521 stars 99 forks source link

Feature request: System recovery on node failure #303

Open a8t3r opened 9 years ago

a8t3r commented 9 years ago

Thanks a lot for your product!

Currently i'm looking for troubleshooting and system recovery technics over glu infrastructure. Take a look at marathon:

Imagine that one of the datacenter workers trips over a power cord and a server gets unplugged. No problem for Marathon, it moves the affected search service and Rails tasks to a node that has spare capacity. The engineer may be temporarily embarrased, but Marathon saves him from having to explain a difficult situation!

So, imagine that we have only three working nodes (N1, N2, N3) with three services (S1, S2, S3) running on them, one for each (N1 -> S1, N2 -> S2, N3 -> S3). Resource utilization on nodes equals U1, U2, U3. At some moment node N2 downfalls and service S2 takes state 'UNDEPLOYED' at glu console. Ok, we've got a problem! According to reserved resource capacity at another nodes (for example, U1 << U3), glu orchestration engine automatically takes a decision to redeploy service S2 from node N2 to node N1.

Definitely i can write custom zk listener for that purposes, but i think this feature will be more useful at glu core.

ypujante commented 9 years ago

I think I understand what you are asking. glu has been designed from the very beginning to not do anything automatically. It was a conscious decision. glu has been designed as a platform on top of which you can build this kind of behavior.

In your case, what would need to happen: the static model needs to be modified and glu needs to "deploy" it. It is not very obvious how to handle the modification of the model in an automated fashion since it is essentially a black box to glu and would obviously be very different from customer to customer.

Where I think glu could help would be in providing a hook to plug in some behavior that is triggered when errors are detected. As you mention you can listen to ZooKeeper yourself outside of glu but it seems that since glu is notified of errors, it might be easier to provide a hook directly in glu.

I will think about it.

Yan

adamtulinius commented 9 years ago

You can't just start software up on another node blindly. Yes, Marathon will do that, but it obviously requires the software to be stateless (or at least use non-local storage for whatever it does).

With regards to Mesos, it might be interesting to implement af mesos framework (or just executor) for glu. :-)