Closed arssher closed 10 months ago
The OOM logs are interesting. I've duplicated them here with Grafana's timestamps for posterity:
Things that jump out:
I'm not sure I'm reading that correctly, especially given 1.95 + 7.21 > 8.99, but broadly pagecache is very high. (I think that matches @knizhnik's analysis here?)
Some other information:
Memory usage, according to the metrics that the autoscaler-agent was receiving looked consistent:
So: Even though the memory was behaving in roughly the same way each time (and the VM did scale up from 8 → 9 GiB RAM [agent request at 18:19:01]), presumably we just crossed some threshold? The pattern of "gradual increase then drop-off to the baseline" occurred every 3.5 minutes, and peak memory (that we observed) was trending slightly higher each time. That said, this would mean that the VM's memory usage sharply increased at the end of the spike and we never observed it, which... may be true? I'm not sure.
Getting OOM on simple pgbench init workload: https://neonprod.grafana.net/goto/gCtQGCeVR?orgId=1
It worked before.
command used: pgbench -s300 -i -I dtGvp
I think this issue was due to OOM issues that have since been fixed (iirc, with neondatabase/neon#5333). In light of that, closing this — we can reopen if it persists.
User (and I) consistently get OOM when doing pgvector CREATE INDEX ... USING ivfflat on 15GB table. OOM logs show that compute wasn't upscaled to max CU. Example (here max cu is 4, i.e. 16GB, but OOM happened at 9GB): https://neonprod.grafana.net/goto/EAgSWo3Vg?orgId=1 endpoint: https://console.neon.tech/admin/endpoints/ep-sweet-feather-898825
a bit more context: https://neondb.slack.com/archives/C03TN5G758R/p1690319779712199?thread_ts=1690313755.185089&cid=C03TN5G758R
I have access to user db and can pass it to someone to repro.