neondatabase / autoscaling

Postgres vertical autoscaling in k8s
Apache License 2.0
156 stars 21 forks source link

Bug: Kernel panic on memory plug #450

Open vadim2404 opened 1 year ago

vadim2404 commented 1 year ago

Environment

Steps to reproduce

Expected result

Actual result

Other logs, links

vadim2404 commented 1 year ago

@knizhnik created a branch with in-memory cache neondatabase/neon#4891

sharnoff commented 1 year ago

ref https://neondb.slack.com/archives/C03TN5G758R/p1692042048475039?thread_ts=1690842628.716389

I've been able to reproduce the kernel panic with a relatively simple setup (at least, not requiring Neon's entire stack :sweat_smile:). And, it looks the panic requires tmpfs usage to occur, and does not trigger with equivalent load from anonymous mmap usage.

Below are the notes I wrote for myself while trying to debug this, along with the various files referenced there.

Debugging log
**tl;dr: reproduction steps towards the bottom** ## 2023-08-01 **Version B** Using `repro-loop.sh`. Version "B" seems most promising. Using this command inside the VM: ```sh while true; do allocate-loop 1024 1536; done ``` with VM `repro-loop.sh` MIN=2, MAX=3, VM spec min=1, max=10 Initially looks kind of promising because it's able to produce pathological behavior — VM stalling for a significant amount of time, and page migration failure. **Version C** Trying roughly the same thing as above, but seeing if CPU makes a difference. So: vm spec now has max CPUs = max memSlots = 7. CPU + 7 slots actually makes it more stable. Going back to just memory, see if 7 slots is the difference. **Back to version B** Trying again with 1 CPU, max memSlots = 7. Looks like that's *also* stable. So: try with large max memory slots? (we use 28 on prod!). Result: Also stable. Weird. Trying the original max=10. Result: Also stable. Maybe it was because of 0.25 CPU? Trying that now. 0.25 CPU is *not* stable (!!) - results in lots of page migration failures, + killing processes when they're not actually using much memory. -> Actually, it's now killing before 1024, when there should be >= 2 memory slots. -> Checking /proc/meminfo shows MemTotal = 946136 kB, or ~1 GIB Next up: 0.05 CPU, see if that's even more volatile. Result: *maybe*? Hard to say. It's very slow. Trying again with 0.1 CPU. Result: Similar instability, had to stop once familiar "attempt to add duplicate property" Modfying command to: ```sh while true; do cat /proc/meminfo | grep MemTotal; allocate-loop 1024 1536; done ``` Now trying with larger difference betweeen scaling values. Say spec max = 10 still, but repro-loop min=2 / max=6. Bumping DELAY from 5 -> 10s to give it more time. Result: About the same as before. Same instability, but still no panic, and eventually got On retry: Got the VM to completely stall for ~1 minute (maybe 2?). Obviously... not great. Needs more investigation. Also got it to have the "MemTotal reset back to 1 GiB" issue. New strategy: Exit early if we detect that the total reset, so maybe we can see what's up with that? ```sh while true; do total="$(cat /proc/meminfo | grep MemTotal | grep -oE [[:digit:]]+)"; echo "total = $total KiB"; if [ "$total" = "946136" ]; then exit 1; fi; allocate-loop 1024 1536; done ``` It *did* work, and correctly reproduces the "MemTotal reset" issue, but didn't repro the kernel panic. A new attempt: Try to run many of these at once: ```sh # Expanded: loop_loop () ( id="$1" while true; do total="$(cat /proc/meminfo | grep MemTotal | grep -oE [[:digit:]]+)" echo "id=$id: total = $total KiB" if [ "$total" = 946136 ]; then exit 1; fi allocate-loop 256 384 # 1024/4, 1536/4 done ) for i in $(seq 4); do loop_loop $i & done # Collapsed: loop_loop () ( id="$1"; while true; do total="$(cat /proc/meminfo | grep MemTotal | grep -oE [[:digit:]]+)"; echo "id=$id: total = $total KiB"; if [ "$total" = 946136 ]; then exit 1; fi ; allocate-loop 128 384; done ) for i in $(seq 8); do loop_loop $i & done ``` Broadly: doesn't work, unfortunately. Very good at triggering OOM, and same "MemTotal reset". Trying with better neonvm-controller debugging, maybe QEMU has more output to give us. Result: Opened #451 from the analysis. Didn't get much else. New plan: Try with *many* more processes, and more memory slots, and *somewhat* more CPUs. Maybe this will be closer to prod. ```sh loop_loop () ( id="$1"; while true; do total="$(cat /proc/meminfo | grep MemTotal | grep -oE [[:digit:]]+)"; echo "id=$id: total = $total KiB"; if [ "$total" = 946136 ]; then exit 1; fi ; allocate-loop 128 256; done ) for i in $(seq 32); do loop_loop $i & done ``` Setting cpus = { min: 1, use: 1.25, max: 2 } (I suspect that fractional CPU has an effect). And memorySlots = { min: 1, use: 1, max: 20 }, with repro-loop MIN/MAX = 16/20 ## 2023-08-02 Got an idea from Vadim to try with a tmpfs. We'll see if that works. So: Using `repro-interactive.sh`. The general idea is to fill up a tmpfs, and then try to downscale below the usage. To create the tmpfs: ```sh mkdir tmp && mount -t tmpfs -o size=10G none tmp ``` To fill it up to a certain number of gibibytes: ```sh fill_tmpfs () ( rm -f tmp/file; touch tmp/file; c="$(echo "( $1 ) * 1024^2" | bc | grep -oE '^[[:digit:]]+')"; echo "copying $c KiB"; dd if=/dev/zero of=tmp/file bs=1K count="$c"; ) ``` Initial notes: Filling 1 GiB tmpfs with less than 1 GiB total memory is... not good. The kernel will start OOM-looping, and there's really just no way to recover. First test: Scale up to 6 GiB, fill tmpfs to 3 GiB, scale down to 4 GiB. Result: ok. Continuing... Scale back up to 6, fill to 5 GiB, reduce tmpfs to 3.5 GiB, scale down to 4 GiB. Result: Kernel panic! Ok, now let's see if we can make the reproduction smaller. Trying: Scale up to 6 GiB, fill tmpfs to 4 GiB, remove tmpfs *file*, scale down to 2 GiB. Result: Did *not* kernel panic. Trying: Scale up to 6 GiB, fill tmpfs to 4 GiB, reduce tmpfs to 2 GiB, scale down to 4 GiB. Result: Did not kernel panic. Wondering if there's some "down-up-down" pattern requirement? Trying: Scale 6GiB, tmpfs 3 GiB, scale 4 GiB, scale 6 GiB, tmpfs 5 GiB, tmpfs 3.5 GiB, scale 4 GiB. Result: Did not panic (??) Tried again: *Did* panic, maybe previous was a fluke? Tried again: Also panicked again. Guessing that that one was a fluke. Yay! Now: trying with updated kernel to see if that helps. Result: It didn't. :'( --- ## 2023-08-13 Wrote alloc-helper.c, hoping to retry to see if Konstantin's patch should fix the kernel panic issue. Going to try it two ways: With changes to size going through 0 first (which is _basically_ what the tmpfs thing did), or directly. First issue: OOM when trying to make the mmap. Need to tweak the memory overcommit setting, I think. ## 2023-08-14 Ran out of time yesterday, continuing. It looks like `overcommit_ratio` is 50, for some reason, which explains why the mmap calls were failing. To fix, need to `echo 1 > /proc/sys/vm/overcommit_memory` on startup. With spec.memorySlots.max = 10: Trying: Scale 6GiB, mmap 3 GiB, scale 4 GiB, scale 6 GiB, mmap 3→0→5 GiB, mmap 5→0→3.5 GiB, scale 4 GiB. Result: OOM-killed, but no kernel panic! Trying exactly the same again. Result: Same. Trying same, but without transit through 0 for mmap. Result: Same. And, just to be sure, retry the original tmpfs reproduction with vm.overcommit_memory = 1: Result: Kernel panic! ``` $ kubectl get -w neonvm postgres14-disk-test 100m 1Gi postgres14-disk-test-295fb Running 26s postgres14-disk-test 100m 1Gi postgres14-disk-test-295fb Running 54s postgres14-disk-test 100m 1Gi postgres14-disk-test-295fb Scaling 54s postgres14-disk-test 100m 1Gi postgres14-disk-test-295fb Running 54s postgres14-disk-test 100m 2Gi postgres14-disk-test-295fb Running 54s ```
repro-loop.sh (unused in final reproduction steps) ```bash #!/usr/bin/env bash # # Given a local cluster, try to repeatedly change memory usage for postgres14-disk-test, in order to # (hopefully) trigger a kernel panic. VM_NAME=postgres14-disk-test MIN=16 MAX=20 DELAY='15s' submit_patch_mem () { slots="$1" patch='[{ "op": "replace", "path": "/spec/guest/memorySlots/use", "value": '"$slots"' }]' echo -n "patching to $slots memory slots... " kubectl patch neonvm "$VM_NAME" --type=json --patch "$patch" } submit_patch_both () { cu="$1" patch='[ { "op": "replace", "path": "/spec/guest/cpus/use", "value": '"$cu"' }, { "op": "replace", "path": "/spec/guest/memorySlots/use", "value": '"$cu"' } ]' echo -n "patching to $cu CU... " kubectl patch neonvm "$VM_NAME" --type=json --patch "$patch" } # VERSION A: Didn't work. # while true; do # slots=$(( $RANDOM % ($MAX - $MIN + 1) + $MIN )) # submit_patch_mem "$slots" # # sleep "$DELAY" # done # VERSION B: Maybe closer? Easier to get bad behavior. while true; do submit_patch_mem "$MIN" sleep "$DELAY" submit_patch_mem "$MAX" sleep "$DELAY" done # VERSION C: # while true; do # submit_patch_both "$MIN" # sleep "$DELAY" # submit_patch_both "$MAX" # sleep "$DELAY" # done ```
repro-interactive.sh ```bash #!/usr/bin/env bash VM_NAME=postgres14-disk-test submit_patch_mem () { slots="$1" patch='[{ "op": "replace", "path": "/spec/guest/memorySlots/use", "value": '"$slots"' }]' echo -n "$(date --utc --iso-8601=seconds) :: Patching to $slots memory slots... " kubectl patch neonvm "$VM_NAME" --type=json --patch "$patch" } while true; do echo -n 'Memory slots: ' read -r slots if [ ! -z "$slots" ]; then submit_patch_mem "$slots" fi done ```
alloc-helper.c ```c #define _GNU_SOURCE #include #include #include #include #include #include size_t KIB = (1 << 10); size_t MIB = (1 << 20); size_t GIB = (1 << 30); // https://github.com/neondatabase/postgres/blob/026d6b093d49e25cec44dd04598152329ceac027/src/include/portability/mem.h#L41 #define PG_MMAP_FLAGS (MAP_SHARED|MAP_ANONYMOUS) char* initial_alloc(size_t size) { // Roughly matches what postgres's mmap-based shem allocation uses, with the notable difference // that we're not trying to do huge pages here, whereas postgres does. // // ref https://github.com/neondatabase/postgres/blob/026d6b093d49e25cec44dd04598152329ceac027/src/backend/port/sysv_shmem.c#L624-L625 return (char*)mmap(NULL, size, PROT_READ | PROT_WRITE, PG_MMAP_FLAGS, -1, 0); } void set_size(char* base, size_t total, size_t current, size_t desired) { if (desired > total) { fprintf(stderr, "desired size greater than total: %zu > %zu\n", desired, total); exit(1); } // if we need to allocate more, just fill in the allocation: if (current < desired) { for (size_t i = current; i < desired; i++) { base[i] = i; } } // otherwise, free existing memory. else if (current > desired) { if (madvise(base + desired, current - desired, MADV_REMOVE) == -1) err(errno, "madvise"); } } int main(int argc, char* argv[]) { if (argc != 2) { fprintf(stderr, "bad arguments\n"); return 1; } double total_size_gib; if (sscanf(argv[1], "%lf", &total_size_gib) != 1) { fprintf(stderr, "bad argument: must be a floating-point number of GiB\n"); return 1; } size_t total_size = (size_t)(total_size_gib * (double)GIB); printf("creating initial allocation with total size %zu MiB\n", total_size / MIB); char* base_alloc = initial_alloc(total_size); if (base_alloc == (void *)-1) { err(errno, "mmap"); } size_t current_size = 0; while (1) { printf("new desired size in GiB: "); fflush(stdout); char line[256]; if (fgets(line, sizeof line, stdin) == NULL) { printf("\n"); return 0; } double desired_size_gib; if (sscanf(&line[0], "%lf", &desired_size_gib) != 1) { fprintf(stderr, "bad input: expected floating-point number of GiB\n"); continue; } size_t desired_size = (size_t)(desired_size_gib * (double)GIB); printf("updating current size from %zu MiB to %zu MiB\n", current_size / MIB, desired_size / MIB); set_size(base_alloc, total_size, current_size, desired_size); current_size = desired_size; } } ```
vadim2404 commented 1 year ago

Should we merge mmap version of the cache and remove our tmpfs hacks?

sharnoff commented 1 year ago

Yes, once reviewed & passing tests, etc. I'm not familiar enough with C to review it.

sharnoff commented 1 year ago

This has been mitigated by moving the file cache to disk, but the underlying kernel issue is still present, so: leaving this issue open, but reducing priority.

sharnoff commented 10 months ago

We recently updated the kernel (#638), and should check whether this issue is now fixed because of it.