OOM in production with high tenant count

koivunej commented 1 year ago

OOMs have been observed in production, related slack threads:

My understanding of the root cause is that while pageserver uses somewhat conservative amount of RAM, for each been-active tenant we have a postgres --wal-redo process, which is about 22MB RSS idle.

Possible solutions iterated in threads:

enable linux swap
- tail latency++
higher oom score adj for postgres --wal-redo via choom for example
exit postgres --wal-redo on a timeout from https://github.com/neondatabase/neon/blob/7991bd3b6921ccdd13f0f38085127bbe282d4f26/pgxn/neon_walredo/walredoproc.c#L823
- instead #5877 implemented this on the pageserver side
detect process is already exited before sending request in https://github.com/neondatabase/neon/blob/7991bd3b6921ccdd13f0f38085127bbe282d4f26/pageserver/src/walredo.rs#L265-L268
- via killed via oom_score_adj, or having exited as suggested above

koivunej commented 1 year ago

Increasing oom_score_adj for each postgres --wal-redo would probably have a negative effect in the case memory usage of one would spike, then it would be a very likely candidate to be killed and we would just retry.

shanyp commented 1 year ago

adding relevant discussion from slack: https://neondb.slack.com/archives/C03H1K0PGKH/p1678107912052379

koivunej commented 1 year ago

Because #3739 was merged, all that remains is dying on timeout on read iff we have completly handled sending page, so don't die between start...walrecords before getpage.

shanyp commented 1 year ago

@koivunej any objections to close this one ? or do we have followups ? (I think this is dup of #3687)

koivunej commented 11 months ago

this is dup of #3687

Well, it cannot duplicate a later issue now can it :)

This became relevant for the choom parts. Looking around, I do still see a chromium bug open about the fact so unsure if this is doable: https://bugs.chromium.org/p/chromium/issues/detail?id=333617 -- esp. given rss differences of pageserver AND walredo processes, but there might be spikes of which we do not know.

koivunej commented 11 months ago

Discussed in 2023-11-06 meeting, not going to be worked in near future.

koivunej commented 10 months ago

Noted #5877 in the issue description. I don't think we need the choom route at least currently, which is the only unimplemented.

neondatabase / neon

OOM in production with high tenant count #3620