openvstorage / framework

The Framework is a set of components and tools which brings the user an interface (GUI / API) to setup, extend and manage an Open vStorage platform.
Other
27 stars 23 forks source link

Update node distance map with a volume potential factor #2094

Open JeffreyDevloo opened 6 years ago

JeffreyDevloo commented 6 years ago

Enhancement description

HA is currently done by the edge client which checks the node distance map. When volumedriver instances can no longer hold any more volumes: it should not be eligible HA destination

wimpers commented 6 years ago

@redlicha does the voldrv still accept the new volume or does it fail straight away?

redlicha commented 6 years ago

A "filled up" voldrv will become the new owner (if instructed by HA / the API), but actually starting the volume will fail due to resource shortage (most likely candidate: SCO cache cannot guarantee progress b/c there's not enough room to cover a TLog worth of SCOs).

wimpers commented 6 years ago

@redlicha any reason why we not block the transfer of the ownership so the edge knows it selected the wrong one? I assume this has to do with the fact that you can move a volume but not start it and only started volumes are taking into account for volume potential? Maybe we can block the ownership move (leaving a force method)?

redlicha commented 6 years ago

No, it's not about non-running volumes but the underlying assumption so far that HA will not be attempted to nodes that can't host more volumes based on the node distance map.

Adhoc I think it's possible to add an extra check of the local volume potential to voldrv before transferring (grabbing) the ownership (NB: the check is expensive in that it has to fetch the volume config from the backend. This should not impact the "happy" failover path where there are sufficient resources as that can be reworked accordingly, but the case where the voldrv cannot host more volumes).

wimpers commented 6 years ago

... underlying assumption so far that HA will not be attempted to nodes that can't host more volumes based on the node distance map...

In my opinion this is abusing the node distance map.