Open jieyu opened 8 years ago
when this was developed, I did a test:
I did not observe a cleanup call related to the task -resulting in a mount leak
What are the guarantees in both a slave service crash and a full cluster node restart? Should cleanup always be called for every container that was running before the crash or restart?
The current implementation is based on the assumption that if the container is still running, it will be in the orphans list on the recover call. If it is on the orphans list, unmount is NOT called. Are you saying that there is a scenario where a container can remain running and not be in the orphans list?
A container is called orphan if it's not in 'ContainerState' but the launcher detects it (e.g., through cgroups). This is a common when the operator wants to do an upgrade that requires killing all existing containers. THe operator usually does by stop the agent, wipe the metadata directory, and restart the slave. In that case, 'inUseMounts' does not contain the mounts for that orphan container. Since you checkpointed it, it'll be considered as a 'legacyMounts', and umount will be called on it even though the launcher hasn't cleaned it up yet.
OK, so I understand you are saying the mount will be present, and ContainerState will NOT contain a record for the container.
Current recover implementation is: mount will be in legacyMounts mount will not be in inUseMounts
Unmount is called for any mount that is in legacyMounts but not in inUseMounts. There is no other Unmount processing.
So the change you propose would result in no unmount calls in recover() under any circumstances.
This might be OK for the scenario you outlined of stopping the agent, wiping the metadata directory and restarting, assuming cleanup will always be called for every container not running at restart.
But what about other scenarios:
For 1b case, the container will still be in ContainerState list because the agent believes that it's still running. The containerizer will reap on the checkpointed pid and find out that the container is terminated. It'll subsequently call 'destroy', resulting in 'cleanup' being called. https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L573
The isolator determines the 'legacy mounts' (no active container is using that) during slave recovery by just looking at the checkpointed state and will umount those legacy mounts immediately. However, for known orphans, they might not be killed yet (known orphans are those containers that are known by the launcher). The Mesos agent will do an async cleanup on those known orphan containers.
Umounting the volumes while those orphan containers are still using them might be problematic. The correct way is to wait for 'cleanup' to be called for those known orphan containers (i.e., still create Info struct for those orphan containers, and do proper cleanup in 'cleanup' function).