Closed shayanh closed 9 months ago
I investigated this issue and discussed earlier today with the autoscaling team. Previously I was wrong about having two concurrent reconcile jobs. Controller runtime guarantees that there will be at most one reconcile job running for each object in a single controller (more info).
Here is what causes the problem:
Restarting VM runner pod
and cleans up the VM object.PodName
..Status.PodName
. So reconciler creates a pod and tries to update the VM status object with the new PodName
. Here we get to the point where we realized our retry logic for updating VM status is flawed. Here is what happens in the retry logic:
virtualmachine
object with the value we received from API server.virtualmachine
object. This is almost equivalent to assigning a variable to itself and ignoring the changes we wanted to actually make.
https://github.com/neondatabase/autoscaling/blob/52a883bfa5cb20e95c2c8f0bc2e913326c6f36a6/neonvm/controllers/virtualmachine_controller.go#L175-L194There a few notes:
Environment
Local
Steps to reproduce
Expected result
Seeing one pod after deleting the first one.
Actual result
Two pods are running for the same VM.
Why this happens?
Running two reconcile jobs for the same VM might lead to this situation. This is a classic race condition. This problem has always has been there but now it's visible since we have increased the number of reconcile workers.