Closed koivunej closed 10 months ago
Increasing oom_score_adj for each postgres --wal-redo
would probably have a negative effect in the case memory usage of one would spike, then it would be a very likely candidate to be killed and we would just retry.
adding relevant discussion from slack: https://neondb.slack.com/archives/C03H1K0PGKH/p1678107912052379
Because #3739 was merged, all that remains is dying on timeout on read iff we have completly handled sending page, so don't die between start...walrecords before getpage.
@koivunej any objections to close this one ? or do we have followups ? (I think this is dup of #3687)
this is dup of #3687
Well, it cannot duplicate a later issue now can it :)
This became relevant for the choom
parts. Looking around, I do still see a chromium bug open about the fact so unsure if this is doable: https://bugs.chromium.org/p/chromium/issues/detail?id=333617 -- esp. given rss differences of pageserver AND walredo processes, but there might be spikes of which we do not know.
Discussed in 2023-11-06 meeting, not going to be worked in near future.
Noted #5877 in the issue description. I don't think we need the choom
route at least currently, which is the only unimplemented.
OOMs have been observed in production, related slack threads:
My understanding of the root cause is that while pageserver uses somewhat conservative amount of RAM, for each been-active tenant we have a
postgres --wal-redo
process, which is about 22MB RSS idle.Possible solutions iterated in threads:
postgres --wal-redo
viachoom
for examplepostgres --wal-redo
on a timeout from https://github.com/neondatabase/neon/blob/7991bd3b6921ccdd13f0f38085127bbe282d4f26/pgxn/neon_walredo/walredoproc.c#L823