vmware-archive / vsphere-storage-for-docker

vSphere Storage for Docker
https://vmware.github.io/vsphere-storage-for-docker
Apache License 2.0
251 stars 95 forks source link

Enabling volume recovery for in use volumes. #2044

Open govint opened 6 years ago

govint commented 6 years ago

Once the plugin has attached a volume (on request by Docker/K8s) the plugin doesn't play a role any longer till that volume is unmounted (on request by Docker/K8s). In the meantime, its entirely possible that when the container (read user app) is using the volume, the volume becomes unavailable for any specific reason (file server disconnects/exits - for example).

There is no definition or approach today of how the application that has lost access to its data can handle such an event. The application in the container has no recourse except to exit. Whats worse, the application may be logging to a persistent volume and that volume itself becomes inaccessible, preventing the app from being able to log even the occurrence of the event. Which will require the user/developer to figure out what happened later once the volume is accessible.

From a storage standpoint the volume plugin can do a bit to restore access to the volume in cases to the extent possible. Propose the below changes:

  1. At a fixed interval, monitor all volumes that have a non-zero refcount (meaning its mounted) - check that the volume is accessible at the location where its mounted in the container host.
  2. For example, test the accessibility of a volume by, say, reading a file on the root folder of the volume that is created when the volume is created (say, _/.vdvs_isavailable) or attempt to read the root dir of the volume and verify its accessible.
  3. For each volume that's inaccessible, issue a probe (new) request to the volume server (on ESX) to handle the issue with the volume.
  4. The volume server could for example perform its diagnostics and either resolves the issue or the volume remains inaccessible.
  5. In the case of a volume hosted on a file server, the plugin may itself internally unmount and remount the volume, for example. Again, the volume may be restored or may remain inaccessible.
  6. Update the Get() call to perform the accessibility check (on attached volumes and inline in Get()) on the volume and report that the volume is online or not. (Note: an orchestrator should ideally be invoking Get() or a new API on the plugin to figure that the volume is accessible and take action in case a volume isn't accessible and re-deploy a container instance to another node after verifying the volume is accessible there).

The above is about what the plugin can do to at least try and restore access to the volume and the application is able to continue running.

govint commented 6 years ago

Will post the change to provide volume status in the Get() response. Its impossible for the plugin to trigger migrating a container off a host if the volume isn't accessible. Neither does CSI support it (could do post 1.0). The container orchestrator Docker should include volume online/offline status in the default volume Get() response and be able to handle access issues with the volume.