Closed sharnoff closed 7 months ago
Thanks for opening this PR! I looked over the changes and I need to spent more time to understand the changes better.
Our watch and relist logic is already complex and these changes makes it even more complicated. We should be careful about complexity here.
I think this actually may not be necessary; there's other, simpler ways around the issue. Will open an issue when I get a chance.
edit: relevant issue: #863
Converted to draft because we'll probably bypass via #863.
Resolved via #863
Fixes #851, refer there for more context.
This PR adds a new feature to
util/watch
- the ability to fetch the upper bound on the number ofRelist
s that have taken place so far (approximately), and trigger relisting only if there hasn't been any more.More symbolically, it adds the ability to fetch a "timestamp"-like token from
watch.Store
, and say "please relist if you haven't yet, since this timestamp".This allows us to fix the duplicate
Relist()
s issue by attaching this token to each pod start event as they come in, meaning that only pod events received after the most recentRelist()
started will trigger another one. Quoting one of the commits:I tested this locally and have confirmed that it fixes #851. Details collapsed below:
Testing details
In general, the testing plan was: 1. Add artificial delay to processing pod start events 2. Create two VMs at the same time, that _don't_ go through the scheduler (so they won't exist in local state) 3. After the VM pod start events are queued, delete the VMs 4. Profit? (jk) I used this patch to modify the scheduler: ```diff diff --git a/pkg/plugin/plugin.go b/pkg/plugin/plugin.go index 7c76c50..d07d833 100644 --- a/pkg/plugin/plugin.go +++ b/pkg/plugin/plugin.go @@ -133,7 +133,12 @@ func makeAutoscaleEnforcerPlugin( } pwc := podWatchCallbacks{ submitStarted: func(logger *zap.Logger, pod *corev1.Pod, vmRelistCount watch.RelistCount) { - pushToQueue(logger, func() { p.handleStarted(hlogger, pod, vmRelistCount) }) + pushToQueue(logger, func() { + hlogger.Info("DEBUG: SLEEPING FOR 10 SECONDS", util.PodNameFields(pod)) + time.Sleep(time.Second * 10) + hlogger.Info("DEBUG: DONE SLEEPING", util.PodNameFields(pod)) + p.handleStarted(hlogger, pod, vmRelistCount) + }) }, submitDeletion: func(logger *zap.Logger, name util.NamespacedName) { pushToQueue(logger, func() { p.handleDeletion(hlogger, name) }) ``` ... and this one, to modify `vm-deploy.yaml`: ```diff diff --git a/vm-deploy.yaml b/vm-deploy.yaml index e549f6e..bdae573 100644 --- a/vm-deploy.yaml +++ b/vm-deploy.yaml @@ -11,7 +11,6 @@ metadata: # Set to "true" to continuously migrate the VM (TESTING ONLY) autoscaling.neon.tech/testing-only-always-migrate: "false" spec: - schedulerName: autoscale-scheduler enableSSH: true guest: cpus: { min: 0.25, use: 0.25, max: 1.25 } ``` I also created a copy of `vm-deploy.yaml` with a different name, to act as the second VM. Commands for testing were: ```sh kubectl apply -f vm-deploy.yaml && kubectl apply -f vm-deploy2.yaml kubectl delete neonvm postgres16-disk-test postgres16-disk-test-2 ``` and this was the scheduler's log output, showing that it relisted only once, even though both VMs were missing: ``` {"level":"info","ts":1709965015.6497762,"logger":"autoscale-scheduler.vm-watch","caller":"plugin/watch.go:291","msg":"Skipping update for VM because .status.podName is empty","virtualmachine":{"namespace":"default","name":"postgres16-disk-test"}} {"level":"info","ts":1709965015.7465107,"logger":"autoscale-scheduler.vm-watch","caller":"plugin/watch.go:291","msg":"Skipping update for VM because .status.podName is empty","virtualmachine":{"namespace":"default","name":"postgres16-disk-test-2"}} {"level":"info","ts":1709965018.1640103,"logger":"autoscale-scheduler.pod-watch","caller":"plugin/watch.go:140","msg":"Received update event for Pod now running","pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"}} {"level":"info","ts":1709965018.1640441,"logger":"autoscale-scheduler.handlers","caller":"plugin/plugin.go:137","msg":"DEBUG: SLEEPING FOR 10 SECONDS","virtualmachine":{"namespace":"default","name":"postgres16-disk-test"},"pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"}} {"level":"info","ts":1709965018.1794043,"logger":"autoscale-scheduler.pod-watch","caller":"plugin/watch.go:140","msg":"Received update event for Pod now running","pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"}} {"level":"info","ts":1709965024.2002063,"logger":"autoscale-scheduler.pod-watch","caller":"plugin/watch.go:146","msg":"Received update event for completion of Pod","pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"}} {"level":"info","ts":1709965024.2221467,"logger":"autoscale-scheduler.pod-watch","caller":"plugin/watch.go:173","msg":"Received delete event for completed Pod","pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"}} {"level":"info","ts":1709965024.2392888,"logger":"autoscale-scheduler.pod-watch","caller":"plugin/watch.go:146","msg":"Received update event for completion of Pod","pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"}} {"level":"info","ts":1709965024.2548335,"logger":"autoscale-scheduler.pod-watch","caller":"plugin/watch.go:173","msg":"Received delete event for completed Pod","pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"}} {"level":"info","ts":1709965028.16438,"logger":"autoscale-scheduler.handlers","caller":"plugin/plugin.go:139","msg":"DEBUG: DONE SLEEPING","virtualmachine":{"namespace":"default","name":"postgres16-disk-test"},"pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"}} {"level":"info","ts":1709965028.1644406,"logger":"autoscale-scheduler.handlers","caller":"plugin/state.go:587","msg":"Handling Pod start event","action":"Pod started","node":"neonvm-root-worker","virtualmachine":{"namespace":"default","name":"postgres16-disk-test"},"pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"}} {"level":"warn","ts":1709965028.1644697,"logger":"autoscale-scheduler.handlers","caller":"plugin/plugin.go:284","msg":"VM is missing from local store. Relisting if it hasn't already been done","action":"Pod started","node":"neonvm-root-worker","virtualmachine":{"namespace":"default","name":"postgres16-disk-test"},"pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"},"pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"},"virtualmachine":{"namespace":"default","name":"postgres16-disk-test"}} {"level":"info","ts":1709965028.1646848,"logger":"autoscale-scheduler.vm-watch.watch","caller":"watch/watch.go:334","msg":"Relisting"} {"level":"info","ts":1709965028.166872,"logger":"autoscale-scheduler.vm-watch.watch","caller":"watch/watch.go:427","msg":"Relist complete, restarting watcher"} {"level":"error","ts":1709965028.1670778,"logger":"autoscale-scheduler.handlers","caller":"plugin/state.go:617","msg":"Error getting VM info for Pod","action":"Pod started","node":"neonvm-root-worker","virtualmachine":{"namespace":"default","name":"postgres16-disk-test"},"pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"},"error":"Could not find VM for pod, even after relist","stacktrace":"github.com/neondatabase/autoscaling/pkg/plugin.(*AutoscaleEnforcer).reserveResources\n\t/workspace/pkg/plugin/state.go:617\ngithub.com/neondatabase/autoscaling/pkg/plugin.(*AutoscaleEnforcer).handleStarted\n\t/workspace/pkg/plugin/state.go:589\ngithub.com/neondatabase/autoscaling/pkg/plugin.makeAutoscaleEnforcerPlugin.func3.1\n\t/workspace/pkg/plugin/plugin.go:140\ngithub.com/neondatabase/autoscaling/pkg/plugin.makeAutoscaleEnforcerPlugin.func11\n\t/workspace/pkg/plugin/plugin.go:225"} {"level":"info","ts":1709965028.167148,"logger":"autoscale-scheduler.handlers","caller":"plugin/plugin.go:137","msg":"DEBUG: SLEEPING FOR 10 SECONDS","virtualmachine":{"namespace":"default","name":"postgres16-disk-test-2"},"pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"}} {"level":"info","ts":1709965038.1678114,"logger":"autoscale-scheduler.handlers","caller":"plugin/plugin.go:139","msg":"DEBUG: DONE SLEEPING","virtualmachine":{"namespace":"default","name":"postgres16-disk-test-2"},"pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"}} {"level":"info","ts":1709965038.1678467,"logger":"autoscale-scheduler.handlers","caller":"plugin/state.go:587","msg":"Handling Pod start event","action":"Pod started","node":"neonvm-root-worker","virtualmachine":{"namespace":"default","name":"postgres16-disk-test-2"},"pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"}} {"level":"warn","ts":1709965038.1678603,"logger":"autoscale-scheduler.handlers","caller":"plugin/plugin.go:284","msg":"VM is missing from local store. Relisting if it hasn't already been done","action":"Pod started","node":"neonvm-root-worker","virtualmachine":{"namespace":"default","name":"postgres16-disk-test-2"},"pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"},"pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"},"virtualmachine":{"namespace":"default","name":"postgres16-disk-test-2"}} {"level":"error","ts":1709965038.167872,"logger":"autoscale-scheduler.handlers","caller":"plugin/state.go:617","msg":"Error getting VM info for Pod","action":"Pod started","node":"neonvm-root-worker","virtualmachine":{"namespace":"default","name":"postgres16-disk-test-2"},"pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"},"error":"Could not find VM for pod, even after relist","stacktrace":"github.com/neondatabase/autoscaling/pkg/plugin.(*AutoscaleEnforcer).reserveResources\n\t/workspace/pkg/plugin/state.go:617\ngithub.com/neondatabase/autoscaling/pkg/plugin.(*AutoscaleEnforcer).handleStarted\n\t/workspace/pkg/plugin/state.go:589\ngithub.com/neondatabase/autoscaling/pkg/plugin.makeAutoscaleEnforcerPlugin.func3.1\n\t/workspace/pkg/plugin/plugin.go:140\ngithub.com/neondatabase/autoscaling/pkg/plugin.makeAutoscaleEnforcerPlugin.func11\n\t/workspace/pkg/plugin/plugin.go:225"} {"level":"info","ts":1709965038.1678982,"logger":"autoscale-scheduler.handlers","caller":"plugin/state.go:767","msg":"Handling deletion of VM pod","action":"VM deletion","pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"}} {"level":"warn","ts":1709965038.1679075,"logger":"autoscale-scheduler.handlers","caller":"plugin/state.go:796","msg":"Cannot find Pod in global pods map","action":"VM deletion","pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"}} {"level":"info","ts":1709965038.167914,"logger":"autoscale-scheduler.handlers","caller":"plugin/state.go:771","msg":"Deleted Pod","action":"VM deletion","pod":{"namespace":"default","name":"postgres16-disk-test-n6wkf"},"migrating":false,"verdict":{"cpu":"","mem":""}} {"level":"info","ts":1709965038.1679213,"logger":"autoscale-scheduler.handlers","caller":"plugin/state.go:767","msg":"Handling deletion of VM pod","action":"VM deletion","pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"}} {"level":"warn","ts":1709965038.1679294,"logger":"autoscale-scheduler.handlers","caller":"plugin/state.go:796","msg":"Cannot find Pod in global pods map","action":"VM deletion","pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"}} {"level":"info","ts":1709965038.1679342,"logger":"autoscale-scheduler.handlers","caller":"plugin/state.go:771","msg":"Deleted Pod","action":"VM deletion","pod":{"namespace":"default","name":"postgres16-disk-test-2-2z8j7"},"migrating":false,"verdict":{"cpu":"","mem":""}} ```
NB: I intend to merge this PR via "rebase and merge" to keep the separate commits.
Notes for reviewers: Probably easiest to review each commit one by one. Happy to receive suggestions for alternate naming — not a huge fan of
RelistCount
; ideally it'd make the implications a little more clear, idk. This PR also ended up much larger than I thought it would, but I think about as complex as @shayanh expected :sweat_smile:Also, I'll probably build locally & deploy to staging for testing, so we can do a hotfix early next week.