neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
163 stars 21 forks source link

Bug: autoscaling.virtio-mem is flaky #1006

Closed Omrigan closed 4 weeks ago

Omrigan commented 3 months ago

Fails roughly 50% of the time.

sharnoff commented 3 months ago

@Omrigan how long has it been flaky for? Can you link some example runs?

Omrigan commented 3 months ago

@Omrigan how long has it been flaky for?

I'd assume from the introduction.

Can you link some example runs?

Your PR: https://github.com/neondatabase/autoscaling/actions/runs/9878750381/job/27283615758#step:14:1034 My PR: https://github.com/neondatabase/autoscaling/actions/runs/9889976645/job/27317397731#step:14:948

I had 3 or 4 such cases, I think.

However, maybe it is not that particular test, here is the other test failing: https://github.com/neondatabase/autoscaling/actions/runs/9890395470/job/27319200362#step:14:987

stradig commented 3 months ago

We should check if this is only timeouts or if we suspect an actual problem. Until then marking as P1.

sharnoff commented 3 months ago

Haven't yet had a chance to look at it. However - @Omrigan if you're truly observing it fail 50% of the time, it may be due to your PR. The occurrence you flagged was the first time I'd seen it fail on one of my PRs.

sharnoff commented 3 months ago

A couple recent runs:

Both cases look like they timed out after 5 minutes on downscaling.

petuhovskiy commented 2 months ago

Reproduces locally in about ~2 hours. From the logs it seems like the issue is that VM uses too much memory sometimes and it prevents downscaling. I think the next step for debugging would be to print memory usage in logs periodically. For now I'll put this issue in selected, but want to get back to it later.

My logs ``` 2024/08/21 02:49:07 running without a 'kuttl-test.yaml' configuration 2024/08/21 02:49:07 kutt-test config testdirs is overridden with args: [ tests/e2e ] === RUN kuttl harness.go:465: starting setup harness.go:255: running tests using configured kubeconfig. harness.go:278: Successful connection to cluster at: https://127.0.0.1:33073 harness.go:363: running tests harness.go:75: going to run test suite with timeout of 30 seconds for each step harness.go:375: testsuite: tests/e2e has 13 tests === RUN kuttl/harness === RUN kuttl/harness/autoscaling.virtio-mem === PAUSE kuttl/harness/autoscaling.virtio-mem === CONT kuttl/harness/autoscaling.virtio-mem logger.go:42: 02:49:07 | autoscaling.virtio-mem | Creating namespace: kuttl-test-helpful-dolphin logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | starting test step 0-create-vm logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | Service:kuttl-test-helpful-dolphin/example created logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | VirtualMachine:kuttl-test-helpful-dolphin/example created logger.go:42: 02:49:07 | autoscaling.virtio-mem/0-create-vm | ConfigMap:kuttl-test-helpful-dolphin/example-config created logger.go:42: 02:49:15 | autoscaling.virtio-mem/0-create-vm | test step completed 0-create-vm logger.go:42: 02:49:15 | autoscaling.virtio-mem/1-upscale | starting test step 1-upscale logger.go:42: 02:49:15 | autoscaling.virtio-mem/1-upscale | Pod:kuttl-test-helpful-dolphin/workload created logger.go:42: 02:49:15 | autoscaling.virtio-mem/1-upscale | ConfigMap:kuttl-test-helpful-dolphin/query created logger.go:42: 02:49:26 | autoscaling.virtio-mem/1-upscale | test step completed 1-upscale logger.go:42: 02:49:26 | autoscaling.virtio-mem/2-downscale | starting test step 2-downscale logger.go:42: 02:54:28 | autoscaling.virtio-mem/2-downscale | test step failed 2-downscale case.go:366: failed in step 2-downscale case.go:368: --- VirtualMachine:kuttl-test-helpful-dolphin/example +++ VirtualMachine:kuttl-test-helpful-dolphin/example @@ -1,14 +1,37 @@ apiVersion: vm.neon.tech/v1 kind: VirtualMachine metadata: + annotations: + autoscaling.neon.tech/bounds: '{ "min": { "cpu": "250m", "mem": "1Gi" }, "max": + {"cpu": 1, "mem": "4Gi" } }' + finalizers: + - vm.neon.tech/finalizer + labels: + autoscaling.neon.tech/enabled: "true" name: example namespace: kuttl-test-helpful-dolphin +spec: '[... elided field over 10 lines long ...]' status: conditions: - - status: "True" + - lastTransitionTime: "2024-08-21T00:49:10Z" + message: Pod (example-jxxtr) for VirtualMachine (example) created successfully + reason: Reconciling + status: "True" type: Available - cpus: 250m - memorySize: 1Gi + cpus: 750m + currentRevision: + revision: + flags: 1 + value: 21 + updatedAt: "2024-08-21T00:54:20Z" + extraNetIP: 10.100.128.1 + extraNetMask: 255.255.0.0 + memoryProvider: VirtioMem + memorySize: 3Gi + node: neonvm-arthur-worker phase: Running + podIP: 10.244.2.134 + podName: example-jxxtr restartCount: 0 + sshSecretName: ssh-neonvm-example case.go:368: resource VirtualMachine:kuttl-test-helpful-dolphin/example: .status.memorySize: value mismatch, expected: 1Gi != actual: 3Gi logger.go:42: 02:54:28 | autoscaling.virtio-mem | autoscaling.virtio-mem events from ns kuttl-test-helpful-dolphin: logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST NormalPod example-jxxtr Binding Scheduled Successfully assigned kuttl-test-helpful-dolphin/example-jxxtr to neonvm-arthur-worker autoscale-scheduler logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST NormalPod example-jxxtr AddedInterface Add eth0 [10.244.2.134/24] from kindnet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST NormalVirtualMachine.vm.neon.tech example OverlayNet Acquired IP 10.100.128.1/16 for overlay network interface virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:07 +0200 CEST NormalVirtualMachine.vm.neon.tech example Created VirtualMachine example created, Pod example-jxxtr, SSH Secret ssh-neonvm-example virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST NormalPod example-jxxtr AddedInterface Add net1 [] from neonvm-system/neonvm-overlay-for-vms logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST NormalPod example-jxxtr.spec.initContainers{init} Pulled Container image "vm-postgres:15-bullseye" already present on machine kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST NormalPod example-jxxtr.spec.initContainers{init} Created Created container init kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:08 +0200 CEST NormalPod example-jxxtr.spec.initContainers{init} Started Started container init kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST NormalPod example-jxxtr.spec.containers{neonvm-runner} Pulled Container image "runner:dev" already present on machine kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST NormalPod example-jxxtr.spec.containers{neonvm-runner} Created Created container neonvm-runner kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST NormalPod example-jxxtr.spec.containers{neonvm-runner} Started Started container neonvm-runner kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST NormalPod example-jxxtr.spec.containers{neonvm-container-mgr} Pulled Container image "runner:dev" already present on machine kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:09 +0200 CEST NormalPod example-jxxtr.spec.containers{neonvm-container-mgr} Created Created container neonvm-container-mgr kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:10 +0200 CEST NormalPod example-jxxtr.spec.containers{neonvm-container-mgr} Started Started container neonvm-container-mgr kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:10 +0200 CEST NormalVirtualMachine.vm.neon.tech example CpuInfo VirtualMachine example uses 0.5 cpu cores virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:10 +0200 CEST NormalVirtualMachine.vm.neon.tech example MemoryInfo VirtualMachine example uses 1Gi memory virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST NormalVirtualMachine.vm.neon.tech example ScaleDown Runner pod cgroups was updated on VM example virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST NormalVirtualMachine.vm.neon.tech example CpuInfo VirtualMachine example uses 0.25 cpu cores virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST NormalPod workload Scheduled Successfully assigned kuttl-test-helpful-dolphin/workload to neonvm-arthur-worker default-scheduler logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:15 +0200 CEST NormalPod workload AddedInterface Add eth0 [10.244.2.135/24] from kindnet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST NormalPod workload.spec.initContainers{wait-for-pg} Pulled Container image "postgres:15-bullseye" already present on machine kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST NormalPod workload.spec.initContainers{wait-for-pg} Created Created container wait-for-pgkubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST NormalPod workload.spec.initContainers{wait-for-pg} Started Started container wait-for-pgkubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST NormalPod workload.spec.containers{pgbench} Pulled Container image "postgres:15-bullseye" already present on machine kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST NormalPod workload.spec.containers{pgbench} Created Created container pgbench kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:16 +0200 CEST NormalPod workload.spec.containers{pgbench} Started Started container pgbench kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST NormalVirtualMachine.vm.neon.tech example ScaleUp Runner pod cgroups was updated on VM example virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST NormalVirtualMachine.vm.neon.tech example ScaleUp Set virtio-mem size for 4Gi total memory virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST NormalVirtualMachine.vm.neon.tech example CpuInfo VirtualMachine example uses 1 cpu cores virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:24 +0200 CEST NormalVirtualMachine.vm.neon.tech example MemoryInfo VirtualMachine example uses 2Gi memory virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:25 +0200 CEST NormalVirtualMachine.vm.neon.tech example MemoryInfo VirtualMachine example uses 4Gi memory virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:49:26 +0200 CEST NormalPod workload.spec.containers{pgbench} Killing Stopping container pgbench kubelet logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST NormalVirtualMachine.vm.neon.tech example ScaleDown Set virtio-mem size for 3Gi total memory virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST NormalVirtualMachine.vm.neon.tech example CpuInfo VirtualMachine example uses 0.75 cpu cores virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST NormalVirtualMachine.vm.neon.tech example MemoryInfo VirtualMachine example uses 3840Mi memory virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST NormalVirtualMachine.vm.neon.tech example MemoryInfo VirtualMachine example uses 3712Mi memory virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST NormalVirtualMachine.vm.neon.tech example MemoryInfo VirtualMachine example uses 3456Mi memory virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST NormalVirtualMachine.vm.neon.tech example MemoryInfo VirtualMachine example uses 3328Mi memory virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST NormalVirtualMachine.vm.neon.tech example MemoryInfo VirtualMachine example uses 3200Mi memory virtualmachine-controller logger.go:42: 02:54:28 | autoscaling.virtio-mem | 2024-08-21 02:50:55 +0200 CEST NormalVirtualMachine.vm.neon.tech example MemoryInfo VirtualMachine example uses 3Gi memory virtualmachine-controller logger.go:42: 02:54:29 | autoscaling.virtio-mem | Deleting namespace: kuttl-test-helpful-dolphin === NAME kuttl harness.go:408: run tests finished harness.go:516: cleaning up harness.go:573: removing temp folder: "" --- FAIL: kuttl (327.52s) --- FAIL: kuttl/harness (0.00s) --- FAIL: kuttl/harness/autoscaling.virtio-mem (327.48s) FAIL ```
stradig commented 1 month ago

This might be fixed by now. Arthur will reproduce and see if it is fixed.

petuhovskiy commented 4 weeks ago

https://gist.github.com/petuhovskiy/226522b34bd85a3a8d2d8ee88fa43dbd

Tried to reproduce, it failed only once in 360 runs. Let's consider it fixed for now.