thecodeteam / mesos-module-dvdi

Mesos Docker Volume Driver Isolator module
Apache License 2.0
77 stars 16 forks source link

Isolator recovery is problematic with legacy mounts. #95

Open jieyu opened 8 years ago

jieyu commented 8 years ago

The isolator determines the 'legacy mounts' (no active container is using that) during slave recovery by just looking at the checkpointed state and will umount those legacy mounts immediately. However, for known orphans, they might not be killed yet (known orphans are those containers that are known by the launcher). The Mesos agent will do an async cleanup on those known orphan containers.

Umounting the volumes while those orphan containers are still using them might be problematic. The correct way is to wait for 'cleanup' to be called for those known orphan containers (i.e., still create Info struct for those orphan containers, and do proper cleanup in 'cleanup' function).

cantbewong commented 8 years ago

when this was developed, I did a test:

  1. start a task with an external mount
  2. Kill the Mesos slave process to simulate a crash..
  3. The task terminates
  4. Mesos slave service is restarted

I did not observe a cleanup call related to the task -resulting in a mount leak

What are the guarantees in both a slave service crash and a full cluster node restart? Should cleanup always be called for every container that was running before the crash or restart?

The current implementation is based on the assumption that if the container is still running, it will be in the orphans list on the recover call. If it is on the orphans list, unmount is NOT called. Are you saying that there is a scenario where a container can remain running and not be in the orphans list?

jieyu commented 8 years ago

A container is called orphan if it's not in 'ContainerState' but the launcher detects it (e.g., through cgroups). This is a common when the operator wants to do an upgrade that requires killing all existing containers. THe operator usually does by stop the agent, wipe the metadata directory, and restart the slave. In that case, 'inUseMounts' does not contain the mounts for that orphan container. Since you checkpointed it, it'll be considered as a 'legacyMounts', and umount will be called on it even though the launcher hasn't cleaned it up yet.

cantbewong commented 8 years ago

OK, so I understand you are saying the mount will be present, and ContainerState will NOT contain a record for the container.

Current recover implementation is: mount will be in legacyMounts mount will not be in inUseMounts

Unmount is called for any mount that is in legacyMounts but not in inUseMounts. There is no other Unmount processing.

So the change you propose would result in no unmount calls in recover() under any circumstances.

This might be OK for the scenario you outlined of stopping the agent, wiping the metadata directory and restarting, assuming cleanup will always be called for every container not running at restart.

But what about other scenarios:

  1. Kill -9 (simulate crash) the agent and restart. No metadata directory wipe. 1a. container still running when agent restarts - should be OK, I expect container will be on the ContainerState list and no unmount will be called, which is what you want. 1b. container not still running when agent restarts - seems like you need an unmount, unless the the agent will do a cleanup() call for the container. Will this happen?
jieyu commented 8 years ago

For 1b case, the container will still be in ContainerState list because the agent believes that it's still running. The containerizer will reap on the checkpointed pid and find out that the container is terminated. It'll subsequently call 'destroy', resulting in 'cleanup' being called. https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L573