neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
163 stars 21 forks source link

Bug: NeonVM controller creates more than one pod for a VM #794

Closed shayanh closed 9 months ago

shayanh commented 9 months ago

Environment

Local

Steps to reproduce

$ kubectl apply -f vm-deploy.yaml
virtualmachine.vm.neon.tech/postgres14-disk-test created

$ kubectl get pods
NAME                                 READY   STATUS    RESTARTS   AGE
postgres14-disk-test-g5qvs           2/2     Running   0          2m26s

$ kubectl delete pod postgres14-disk-test-g5qvs
pod "postgres14-disk-test-g5qvs" deleted

$ kubectl get pods
NAME                                 READY   STATUS    RESTARTS   AGE
postgres14-disk-test-rgv6t           2/2     Running   0          28s
postgres14-disk-test-jcxsd           2/2     Running   0          28s

Expected result

Seeing one pod after deleting the first one.

Actual result

Two pods are running for the same VM.

Why this happens?

Running two reconcile jobs for the same VM might lead to this situation. This is a classic race condition. This problem has always has been there but now it's visible since we have increased the number of reconcile workers.

shayanh commented 9 months ago

I investigated this issue and discussed earlier today with the autoscaling team. Previously I was wrong about having two concurrent reconcile jobs. Controller runtime guarantees that there will be at most one reconcile job running for each object in a single controller (more info).

Here is what causes the problem:

  1. Reconcile job 1: Observes the runner pod doesn't exist (since we killed it earlier). Reconciler logs Restarting VM runner pod and cleans up the VM object.
  2. Reconcile job 2: Creates a new pod and updates VM's status with the new PodName.
  3. Reconcile job 3: Reads stale data from k8s API server. Note that this stale object doesn't have a .Status.PodName. So reconciler creates a pod and tries to update the VM status object with the new PodName. Here we get to the point where we realized our retry logic for updating VM status is flawed. Here is what happens in the retry logic:

796 fixes the retry logic.

There a few notes:

  1. Why we are reading stale data from k8s API server? The answer is multiple layers of caching in the middle. Checkout https://github.com/kubernetes-sigs/controller-runtime/issues/1464
  2. Our reconcile job is tricky. It has side effects and a failure might leaves us with undesired partial results.
  3. I'm going to add some new e2e tests where we kill a runner pod. In the future, we can think of using https://github.com/sieve-project/sieve to better test our operators.