Shutdown containers on isolated nodes

datamattsson commented 7 years ago

I apologize in advance if this issue is misplaced or a duplicate but I could wholeheartedly not find anything that relates to my description.

In the scenario a manager or worker gets isolated from the swarm due to network partition or similar there's is no way for the operator to configure what the desired outcome would be of such a failure.

The observed behavior right now is that services continue to run on the isolated nodes. We need an option to immediately shutdown the containers running on the isolated node after a grace period.

In my experiments it's quite trivial to corrupt Docker Volumes that is being served by a shared filesystem as whatever workload was running on the isolated node gets rescheduled on the Swarm.

There are of course ways to invent application level locking and fencing to prevent corruption but I would find it more convenient to disqualify isolated nodes that can't witness quorum in the Swarm to simply relinquish their services.

In most scenarios it would be desirable to keep running the service on the isolated node if it's just running an ephemeral service. Hence it would be desirable to have an option at service creation what to expect during isolation and keeping the default as-is.

Proposal: --during-isolation In the event of a node loosing visibility of the quorum, the service should (stop|continue) (default continue)

aluzzardi commented 7 years ago

/cc @dongluochen @aaronlehmann

Hey @drajen, sorry for the slow response but we mostly check issues on docker/docker rather than here.

We'll look into it!

aluzzardi commented 7 years ago

First thing that pops to mind is it's tough to know whether we're in a partition or the manager is simply down.

If the latter, with this proposal we'd kill all tasks in the cluster and affect perfectly good services.

Regarding volumes, I would have hoped that it'd be the storage responsibility. For instance when using EBS, if the volume is attached to a new container then it's forcefully detached from the other one if still running.

Also, even with this solution there will be race conditions. It might take more time to the worker than for the manager that we're in a partition. Until the worker converges and shuts down the tasks, there will be two of them running at the same time anyway, leading to the corruptions you mentioned

We'll tinker a bit more with the idea

dongluochen commented 7 years ago

I agree with @aluzzardi that we need to consider different scenarios. As cluster manager, Swarm can provide feedback on node reachabilities. But the logic to deal with the failure may be decided by system administrators, not Swarm.

A network partition case can affect a large part of your infrastructure. Managers can be not responding due to upgrade, bug, or attack. If an automatic evacuation is implemented, it may cause large scale service failure.

It's better to have a centralized control on infrastructure failure. When a node is unreachable, Swarm manager detects it and sends an event to an external controller. External controller may issue a repair on the failing machine, like system reboot. If external controller gets a large amount of machine failure, it should trigger alerts to infrastructure.

datamattsson commented 7 years ago

Thank you for looking into this. The current behavior, with the volume driver I'm working with, we're fencing the isolated node from the volume when a manager reschedules the container and mounts the volume elsewhere. It works. My desire is that we could be more graceful about it as the isolated node should understand it's isolated from the quorum and should shutdown the containers cleanly. That way the surviving managers could reschedule the container and mount the volume with a clean filesystem.

I've also done some more testing in my environment and swarm is very quick to judge and declare nodes dead in around 5 seconds and immediately try to fulfill the replica set on the service. It would be ideal if this heartbeat could have an extended grace period set per service.

If possible at all I want to propose these changes as options to not break the current behavior. I genuinely think that services that have volumes mounted should be treated differently and it should be a recommendation from the 3rd party storage vendor to suggest what these grace periods and isolation behaviors should look like.

dongluochen commented 7 years ago

It would be ideal if this heartbeat could have an extended grace period set per service.

Heartbeat is between the worker node and a manager which is not related to service. Do you want to add a per-service configurable delay to the rescheduling of the task?

datamattsson commented 7 years ago

Heartbeat is between the worker node and a manager which is not related to service. Do you want to add a per-service configurable delay to the rescheduling of the task?

I think there's separate needs for stateless containers and containers that carry persistence in volumes, therefor, having a grace period per service could be one solution. Another way to do it is to delay the mount if the volume driver is made aware that an unclean mount is in progress and could simply sit for grace period before attempting the mount. Then it would be up to the volume driver to determine its grace period if one is needed and it's intelligent enough to understand it's taking over a dirty mount. I can solve that on my end. I can't solve shutting down services cleanly on an isolated node though.

dongluochen commented 7 years ago

Another way to do it is to delay the mount if the volume driver is made aware that an unclean mount is in progress and could simply sit for grace period before attempting the mount.

I think this is best practice _"If the same volumename is requested more than once, the plugin may need to keep track of each new mount request and provision at the first mount request and deprovision at the last corresponding unmount request."

I can't solve shutting down services cleanly on an isolated node though.

This PR https://github.com/docker/swarmkit/pull/1758 tries to remove the tasks on the node at swarm leave so that it can properly release networks or volume.

stevvooe commented 7 years ago

@drajen Could you clarify whether this is a test of partitioning or a planned decommissioning? If this is a planned decommissioning, you could drain the node before switching over or scale the service to 0. If this is a failure test, the behavior is more complicated.

It sounds like many of the problems here are arising from broken volume drivers. Upon having the second mount issued, the volume driver should either reject the mount or steal the mount away from the other party if having multiple parties write to same disk will result in corruption.

If the volume driver was working correctly, the service would run on the isolated node until a forceful unmount, which would likely cause the process to fail. When mounting the volume on the new task, this may cost a small recovery period but the new process should proceed just fine.

When it comes down to it, we can't actually guarantee a shutdown of services when there is a partition without sacrificing availability of a service. Even if consistency is favored here, there is no guarantee that the node will detect the partition and shutdown the service properly. There may be ways to make this less impactful, but this is a property of the problem. Tying service availability to quorum availability is a dangerous game and may result in awful cascade failures.

That said, a field on the task, configured per service, to control whether the task should be killed on extended session loss isn't unrealistic. The only problem is that people may use without understanding that it ties service failure to quorum failure.

datamattsson commented 7 years ago

Thank you @stevvooe for your input and suggestions. To answer your first question, this ask is targeted against unplanned failures and we're aware of the drain feature for planned node maintenance.

The volume driver I'm working with is working correctly. If a node becomes isolated, the service will be restarted by the swarm and the last node mounting the volume will fence the volume from the origin node. There's means to panic the isolated node on hung IO to get it out of the forever hung state and have it become healthy once the network split has been remedied.

What I'm looking for are means to be more graceful about the whole ordeal as of right now swarm is very aggressive restarting services (less than 10 seconds) with an abrupt loss of IO to the isolated node. I might be chasing a unicorn here as many storage HA solutions is designed just like how Swarm operates and that's what we should accept.

stevvooe commented 7 years ago

@drajen Thanks for bringing me up to speed! Good to hear the volume driver is working correctly.

Indeed, this is a tough problem. For these kind of IO switchover, the volume subsystem is likely in the best position to observe the switchover and act on failing the rogue process (by ruthlessly unmounting the volume). The model here almost treats the nodes as adversarial, which requires the ability to revoke resources without the node's compliance.

I wonder if this would be better addressed through tuning node heartbeat timeout for a network partition. That would make it less sensitive to momentary blips that aren't really a full quorum loss.

Do you have a full description of the sequence of events? I feel like we are speaking very abstractly.

datamattsson commented 7 years ago

@stevvooe We can be very specific :-)

In my setup, currently, the swarm communicates on one of the interfaces I run iSCSI traffic on. I have two of those iSCSI interfaces. Redundancy for the storage is provided by multipathd. If I ifdown the swarm interface on one of my nodes, the workload it was serving gets restarted in less than ten seconds on another node in the swarm. The origin node is fenced and the service will hang forever waiting for IO that will never happen, even if the network comes back, the initiator will be denied access to the iSCSI target.

As background, the environments I'm designing for is Enterprise-type on-prem solutions. That said, the recommendation I'm leaning towards providing is that the network Swarm communicates on need to be redundant to help mitigate false positives. A false positive I would categorize momentarily loss of a network link due to a switch reboot or similar. In these setups, the storage network will in 99% of the cases be redundant, not necessarily on L2/L3 with MLAG/VPC or similar, but provided with multipathd across block devices seen on two separate subnets.

Now, what I want to get to, that is unrelated to this issue, how to use the Linux bonding driver to borrow virtual interfaces (veth) on the iSCSI interfaces to create the same level of redundancy that iSCSI have. Once I get there, I'm hesitant to believe that a failure of a physical network interface will failover fast and gracefully enough for the swarm to not fail the node having the intermittent link issue. Hence, having a parameter to tune the grace period would be helpful in these experiments to dial in a sane grace period, which I would feel much safer off the bat with if that was in the 30-60 second range than less than 10 which it is at the moment.

Feature creep, it would be great if you could specify multiple IP interfaces the swarm communicates on as I wouldn't need to twiddle with bonding virtual interfaces and it would be a much more visible reliability factored in by having bi-directional communication on multiple subnets.

stevvooe commented 7 years ago

Thanks for the details!

Out of curiosity, what is the RTT time from the iSCSI initiator to target?

Hence, having a parameter to tune the grace period would be helpful in these experiments to dial in a sane grace period, which I would feel much safer off the bat with if that was in the 30-60 second range than less than 10 which it is at the moment.

This sounds like the happiest path. It really sounds like a tuning problem, without investigating further.

We could probably do something more extensive, but it would require more thought.

AkihiroSuda commented 7 years ago

Linking https://github.com/docker/swarmkit/pull/1741 to this issue. There was some discussion about general NW partition, but not about volumes.

I wonder if this would be better addressed through tuning node heartbeat timeout for a network partition.

:+1: for the volume-related issue discussed in this #1743.

However, for issues that are not about volumes, @drajen 's proposal about --during-isolation seems still attractive and needs more thought. (I wonder it should be implemented as a new --restart-condition flag. e.g. --restart-condition on-node-heal-blah-blah-blah?)

datamattsson commented 7 years ago

Out of curiosity, what is the RTT time from the iSCSI initiator to target? This is normally in the lower 100s of μsec on 10GigE networks during normal operation.

I see three possible enhancements moving forward. Listed in most desired and, from my perspective, easiest to implement order:

Tune timeout for a service before declaring it dead and fulfill the desired service replica. This will help rolling upgrades of redundant switch infrastructure and won't disrupt services with external volumes. This could be a global setting but I'd prefer it to be a non-default behavior that you could tune with a service label or such.
The --restart-condition flag, which could be a service label as well that determines what happens to a service when a node detects loss of quorum to the Swarm.
A list of listen addresses that Swarm nodes may communicate with each other to provide redundancy for the raft protocol. This will help unnecessary complex network schemes and allow lowering (or use the default) timeout expressed in bullet 1.

moby / swarmkit

Shutdown containers on isolated nodes #1743