Environment

Local

Steps to reproduce

$ kubectl apply -f vm-deploy.yaml
virtualmachine.vm.neon.tech/postgres14-disk-test created

$ kubectl get pods
NAME                                 READY   STATUS    RESTARTS   AGE
postgres14-disk-test-g5qvs           2/2     Running   0          2m26s

$ kubectl delete pod postgres14-disk-test-g5qvs
pod "postgres14-disk-test-g5qvs" deleted

$ kubectl get pods
NAME                                 READY   STATUS    RESTARTS   AGE
postgres14-disk-test-rgv6t           2/2     Running   0          28s
postgres14-disk-test-jcxsd           2/2     Running   0          28s

Expected result

Seeing one pod after deleting the first one.

Actual result

Two pods are running for the same VM.

Why this happens?

Running two reconcile jobs for the same VM might lead to this situation. This is a classic race condition. This problem has always has been there but now it's visible since we have increased the number of reconcile workers.

I investigated this issue and discussed earlier today with the autoscaling team. Previously I was wrong about having two concurrent reconcile jobs. Controller runtime guarantees that there will be at most one reconcile job running for each object in a single controller (more info).

Here is what causes the problem:

Reconcile job 1: Observes the runner pod doesn't exist (since we killed it earlier). Reconciler logs Restarting VM runner pod and cleans up the VM object.
Reconcile job 2: Creates a new pod and updates VM's status with the new PodName.
Reconcile job 3: Reads stale data from k8s API server. Note that this stale object doesn't have a .Status.PodName. So reconciler creates a pod and tries to update the VM status object with the new PodName. Here we get to the point where we realized our retry logic for updating VM status is flawed. Here is what happens in the retry logic:
- We get a conflict error and we update the virtualmachine object with the value we received from API server.
- We update the VM status on API server with the latest value we got from API server, the recently updated virtualmachine object. This is almost equivalent to assigning a variable to itself and ignoring the changes we wanted to actually make. https://github.com/neondatabase/autoscaling/blob/52a883bfa5cb20e95c2c8f0bc2e913326c6f36a6/neonvm/controllers/virtualmachine_controller.go#L175-L194

796 fixes the retry logic.

There a few notes:

Why we are reading stale data from k8s API server? The answer is multiple layers of caching in the middle. Checkout https://github.com/kubernetes-sigs/controller-runtime/issues/1464
Our reconcile job is tricky. It has side effects and a failure might leaves us with undesired partial results.
I'm going to add some new e2e tests where we kill a runner pod. In the future, we can think of using https://github.com/sieve-project/sieve to better test our operators.

neondatabase / autoscaling

Bug: NeonVM controller creates more than one pod for a VM #794