Open jieyu opened 8 years ago
Your statement on the order of checkpointing occurring after the mount is correct
Order is:
Note that the checkpoint record includes the munt location which is NOT deterministic and can only be recorded after the mount.
The mountpoint is there to handle cases where multiple containers a=on the agent request the idetical volume. When this happens, the isolator needs to lookup the OS mountpoint for the bind mount.
The design accepted the risk of a mount leak on the premise that pre-emptive mount would remedy this under the rare circumstance of a slave service crash.
If we go with a checkpoint first model, the checkpoint would need to be a 2 phase approach of first recording only provider + volume name as a "tenatative mount" and then recording full details. The "tentative mount" info would be enough to do an unmount during recover()
Do you need to checkpoint mount point location? Or you just need provider + volume.
In other words, if your recovery just need provider + volume, why are you checkpointing the mount point (and other informations)?
The bind mount needs the mount point location. Provider + volume is a usually unique identifier but is not sufficient to make a bind call by itself. It might be possible to make a query to the provider but this takes time - and may result in no response adding the complexity that you would want this asynchronous. And some providers, such as Amazon allow duplicate volume names, meaning that the query is only usually reliable as opposed to 100% reliable.
Other information (mount options) was checkpointed for troubleshooting and for the possibility of warning generation should two containers request mount of the same volume with different options. This warning generation is not implemented at this time
So to make it clear, when a second container arrives, requesting a mount of a volume that is already mounted and in use by another container, we need a way to determine the mount point location returned by the mount performed for the first container
Just to clarify, that duplicated volume names sounds weird to me. What's the semantics? Say you mount the same volume name (same driver) twice, there'll be two mounts in the host mount table? What happens if I do a subsequent unmount? Will that unmount both? or just one of them? If just one of them, which one will be unmounted?
Now I see that you need to checkpoint the mount points. Thanks for the explanation.
I looked at the docker volume specification: https://docs.docker.com/engine/extend/plugins_volume/
/VolumeDriver.Mount "Docker requires the plugin to provide a volume, given a user specified volume name. This is called once per container start. If the same volume_name is requested more than once, the plugin may need to keep track of each new mount request and provision at the first mount request and deprovision at the last corresponding unmount request."
It looks to me that the ref counting should be handled by the plugin (e.g., rexray), rather than the container runtime. The container runtime will simply call 'Mount' during container startup and 'Unmount' during container teardown. Given that, I am wondering why we are doing mount tracking in the isolator?
The Docker engine does not keep track of state or mounted volume/container counts today. But this isn't to say this is ideal.
For example, what if the Docker engine crashes and along with it the containers. The plugin would have zero knowledge about this occurring other than in the form of more requests and a higher tracking count. This will lead to the volume never getting unmounted.
The container runtime from our view should be the sole authority behind when to mount and unmount the volumes from the hosts. It should have all of the necessary knowledge that would determine when that operation is proper because it knows about the state of all containers.
So we didn't necessarily do anything different with the spec. We implemented the volume package but changed how the Mesos runtime did accounting to be more sophisticated and tracked directly with containers.
On Friday, April 15, 2016, Jie Yu notifications@github.com wrote:
I looked at the docker volume specification: https://docs.docker.com/engine/extend/plugins_volume/
/VolumeDriver.Mount "Docker requires the plugin to provide a volume, given a user specified volume name. This is called once per container start. If the same volume_name is requested more than once, the plugin may need to keep track of each new mount request and provision at the first mount request and deprovision at the last corresponding unmount request."
It looks to me that the ref counting should be handled by the plugin (e.g., rexray), rather than the container runtime. The container runtime will simply call 'Mount' during container startup and 'Unmount' during container teardown. Given that, I am wondering why we are doing mount tracking in the isolator?
— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/emccode/mesos-module-dvdi/issues/93#issuecomment-210528187
@clintonskitson that makes sense. Thanks for the explanation.
Currently, the isolator checkpoints the mount AFTER doing the actual mount. The consequence of that if the slave crashes after doing the mount before the checkpointing, it loses track of that mount during recovery. One obvious issue is mount leaking. There might be other issues.