Epic: pageserver backpressure

skyzh commented 4 months ago

followup on https://neondb.slack.com/archives/C03F5SM1N02/p1721058880447979

pageserver currently does not limit the write flow of the user. Note that pageserver has both foreground jobs (i.e., safekeeper write, and page reads) and background jobs (compaction, GC). If we don't backpressure, background jobs will have no resources to run, and thus slowing down the foreground jobs, creating a vicious cycle. The long-term goal is to find a way to ensure what the pageserver takes is what it can actually handle.

A quick idea is to use RocksDB's backpressure mechanism, which stalls write when num of L0 SSTs exceed some value.

jcsp commented 4 months ago

Let's look over our existing backpressure-related issues and make a plan

jcsp commented 3 months ago

Plan:

Prioritize: https://github.com/neondatabase/neon/issues/7317
For the l0 stack problem: this may depend on other compaction design decisions? Potentially trigger compaction on LSN rather than time -- that way faster writing tenants get to compact more often.

Our existing mitigation for L0 compaction (only compact 10 at once) makes us safe.

neondatabase / neon

Epic: pageserver backpressure #8390