LV node migration - Githubissues

pschichtel commented 3 weeks ago

Describe the problem/challenge you have

I'm hosting various clustered and stateful applications in kubernetes. Some of these applications require low-latency IO to perform well, like databases and message queues, that's way I use local PVs for these applications, which works great. This way I can put very fast SSDs into these servers and use them without network overhead.

My only pain-point with this setup is (unsurprisingly): The pods, once scheduled, are pinned to their node forever. The only way to move the pod is to delete both the PVC and the pod and hope that the scheduler doesn't decide to put it back onto the same node (sure, this can be helped with node selectors, affinities, anti affinities and taints, but that's even more complexity). An additional, possibly more serious depending on the application, is the fact that node failures can't be recovered from automatically. Even if the application is able to restore its state from remaining peers in its cluster, kubernetes won't execute the pod because it's pinned to a node that's unavailable.

Describe the solution you'd like

Currently, at least that's my current understanding, when kubernetes schedules the pod it works like this (simplified):

if volumeBindingMode is WaitForFirstConsumer, then k8s places the pod and then requests a PV
if volumeBindingMode is Immediate, then k8s places the pod on a node that can access the PV

The former means that lvm-localpv will create a LV on the node that's selected for the pod, the latter means k8s places the pod on the single node carries that LV that has been eagerly created. Either way, it ends with a pod pinned to a node.

What I would love to see is to make an LV available to all nodes in the cluster independent of where it is physically placed. If the LV is already allocated on a node and kubernetes happens to pick a different node, then just create a new LV on the new node, transfer the LV content over the network and delete the old LV. If the LV does not exist already, then it can simply be created on the node that was picked.

That would obviously significantly delay pod startup depending on the size of the volume and it might require a dedicated high-bandwidth network for the transfer as to not interrupt other communication in the kubernetes cluster, but for application clusters that are highly redundant and can cover a failed replica for a prolonged period, this could be perfectly fine.

And actually this could go one step further: Assuming that the application can restore its state from peers in its cluster, a feasible LV migration strategy would be to create a new empty LV without transferring data and let the application do the "transfer".

I could imagine this as a StorageClass option like dataMigrationMode with values:

Disabled (default): current behavior: pin the application to the node with the LV
Application: Just delete the LV on the old node and create a new one on the new node and let the application handle the migration
VolumeTransfer: Create a new LV and transfer data to it before mounting it.

Anything else you would like to add:

While the VolumeTransfer option would be awesome, it also probably quite involved. So being able to just get a new LV on a new node would probably easier. I guess this also requires applications to be well behaved and deployments well configured to not accidentally delete all the data during a rolling upgrade.

avishnu commented 3 weeks ago

Thanks for detailing very clearly. Have you considered the possibility of using a replicated storage like OpenEBS Mayastor for the use-case you have described above? You could setup the replica count as 2, which means volume target (nvme-based) will be writing synchronously to 2 replica endpoints. If one of the replica nodes goes down or becomes unavailable/unreachable, Mayastor will reconcile to spin up a new replica automatically, and starts rebuilding. This is transparent to the stateful application pod.

pschichtel commented 3 weeks ago

I did consider it, yes. What wasn't clear to me there: Do I get the guarantee that either pods will only be placed on nodes with existing replicas or will replicas automatically be moved to the pod? So, can I be sure that IO is always local? Because in some situations it would be worse to have an application member with significantly worse IO latency in the cluster, than to just not have the member available.

pschichtel commented 3 weeks ago

Also: I assume mayastor guarantees consistency between replicas, which forces some write overhead, because the write must be replicated to at least one other replica. Not sure if async/eventually consistent replication is supported.

pschichtel commented 3 weeks ago

So my priority for these deployments is good write latency, which makes synchronous replicated storage basically a no-go. Async replication would be viable to speed up application recover, as it may not need need to start its recovery from scratch.

I see these types of applications:

Applications that cannot restore member state at all (I don't have an example here)
Applications that can restore some state (qdrant can restore its data, but it must retain its cluster membership state)
Applications that can restore their entire state (postgres replicas, hashicorp vault, MinIO, ....)

Applications of type 1 would need a full copy of the old volume to restore a member. In case of a node failure that would not be possible. So these would need synchronous replication anyway, but I don't know of an example of an application like this and I'm not convinced there exists one.

Applications of type 2 would need some parts of the state to be able to restore the rest, I guess this will usually be some form of cluster membership/peer information similar to what qdrant does. These applications need some form of replication to recover a node failure, though these parts of the state don't see as much IO it might be fine to use async replication. I could also imagine splitting up the volume into the part that uses sync replication with mayastore and the part that uses localpv for the best latency.

Applications of type 3 can restore cluster membership without any state, so these would be fine with just simply deleting the state entirely. They might benefit from async replication to be able start migrated pods quicker, but they don't need it. Imagine a MinIO node with terabytes of data.

So considering node failure, only type 3 applications would be able to work without some form of replication and these are the applications that are also fine with just deleting the PV.

I think with this in mind this feature request could be reduced to a simple option to disable the LV <--> node pinning that's currently happening. Mayastore or some other replicated storage system would be required for the other application types anyway.

avishnu commented 2 weeks ago

Also: I assume mayastor guarantees consistency between replicas, which forces some write overhead, because the write must be replicated to at least one other replica. Not sure if async/eventually consistent replication is supported.

Yes, Mayastor, being a block storage solution, needs to maintain strict consistency between all replicas of a volume, which calls for synchronous replication. If a replica lags behind due to temporary or permanent fault, a rebuild process is triggered alongside current write ops. Once the rebuild completes and eventual consistent state is reached, the rebuilt replica falls back to synchronous replication. Another thing to note is the replication is parallel and not sequential, meaning, write ops get written to all replicas simultaneously, so the overhead is limited by the network bandwidth.

niladrih commented 2 weeks ago

openebs / lvm-localpv

LV node migration #314