Backup of RWO volumes used by Pods fails

johbo commented 6 months ago

Description

I have a stuck backup job and I think this is because it tries to backup a RWO volume in the wrong Node.

On the backup Pod I see the following Event:

Multi-Attach error for volume "pvc-0df4196b-7261-425d-bee0-0e74d466a216" Volume is already used by pod(s) gitlab-postgresql-0

The backup Pod is scheduled on my node test-v2-01, the Pod which has the volume mounted is running on Node test-v2-02.

Additional Context

The volume is based on Rook Ceph and a volume in RWO mode.

I started to see the issue after switching over from plain local volumes. So far I only saw this during a bootstrap from scratch, have to observe how it behaves over time in the running cluster.

Logs

No response

Expected Behavior

The backup Pod should be scheduled on the correct Pod so that it can mount the Volume and perform the backup.

I am not sure if there could be a race condition, e.g. k8up decided where to put the backup Pod, the other Pod which uses the Volume starting right after this and k8up then using the outdated data.

Steps To Reproduce

Did not yet find out how to reproduce this reliably. So far I see this problem after a bootstrap of my setup from scratch, this means fresh machines, fresh Kubernetes and then a fresh deployment of everything based on Flux.

Version of K8up

v2.10.0

Version of Kubernetes

v1.30.1+k0s

Distribution of Kubernetes

k0s

Kidswiss commented 6 months ago

Hi @johbo

Thanks for opening the issue.

Is it possible that the pod was restarted right after the backup started? K8up determines what pod is running on what node before creating the jobs. If the pod gets restarted and scheduled to a new node, it can happen that this doesn't match anymore.

johbo commented 6 months ago

Yes, I think this could be the cause of the issue. I've not yet had a chance to look into the k8up code.

Based on your hint I think this could explain it though. During bootstrap I have a few things happening in a sequence which could lead to this problem if I understand you correctly:

Create the volume (Deploy PVC)
Restore data into the Volume via a Job
Wait for the restore Job to complete
Deploy the application

Now I think that the Schedule object is deployed already early into the namespace. K8up could then trigger a backup based on the restore Job's Pod and once the application is deployed it could end up in the situation that the application's Pod is on a different Node.

I think for bootstrapping my setup I can workaround this by delaying the deployment of the backup Schedule.

There might be still a minor problem due to a possible race condition if a Pod restarted and k8up now has the "wrong" Node.

johbo commented 6 months ago

Did observe the situation a little longer and I saw more Jobs piling up, so my assumption that this might be caused by my bootstrapping procedure is probably wrong.

Think I found the code here: https://github.com/k8up-io/k8up/blob/1e7871c51b1a4e1647033fd54847fbecd495f525/operator/backupcontroller/executor.go#L62

Think I have to switch on more verbose logging to see what happens.

From reading the code I have the assumption that Pods with the status "Completed" will not be filtered out in the list. In my setup I do have a few of those who refer to the PVC.

Kidswiss commented 6 months ago

Interesting. But AFAIK a pod in state "completed" should release its lock on the PVC. So K8up should be able to mount it.

Did you see multi-attach errors referencing already completed pods?

johbo commented 6 months ago

I think that I found the cause, currently testing the change in this PR: https://github.com/k8up-io/k8up/pull/979

It seems that there is a catch with RWO Volumes, they are not always bound to a specific Node. I've seen this issue when I switched my setup to use Rook Ceph to provide the RWO Volumes. This means the Volume is not bound to a specific Node anymore. If I now have a Job using the Volume being scheduled on Node 1 and then after it finished another Pod being scheduled on Node 2, then k8up can get a hiccup if the Pods of the first Job are still there in status "Completed" it seems.

I'll try to add a test case this evening into the PR, this should describe the problem more precisely.

johbo commented 5 months ago

In my test setup I did not observe the issue anymore, think that the change above is fixing the issue which I observe.

johbo commented 5 months ago

I am closing the issue now. Have the change from PR https://github.com/k8up-io/k8up/pull/979 running since a few days and the problem is gone.

k8up-io / k8up