Problem

On some pageservers we see >1s times to spawn the process.

Investigation Results

Customer investigation https://neondb.slack.com/archives/C033RQ5SPDH/p1706787518630459?thread_ts=1706774416.482029&cid=C033RQ5SPDH
walredo spawn latency is bimodal on most pageservers: some spawns are fast, taking tens of milliseconds, others asre slow, taking multiple seconds
even though rust stdlib uses the efficient posix_spawn by default, we don't use it on pageservers because we use pre_exec() in close_fds()

DoD

walredo process spawning latency is predictable
acquisition of a walredo process for page reconstruction is < XXX milliseconds

Plan

Explore whether we can us posix_spawn; if so, ship to staging and observe whether it is a sufficient improvement. We can move the close_fds work into walredo startup, where we still trust the process.

If posix_spawn can't be used, implement a sidecar "spawner" process that pageserver asks to spawn walredo processes.

Option 1: extend the existing walredo C code to enter "template" mode.
Option 2: fork off a pagserver child process that will act as the spawner process

NB: we decide against a pool of pre-spawned walredo processes as the amoutn of CPU wasted on the inefficient fork() call is significant.

Background Reading

Work

### Solve The Issue
- [ ] https://github.com/neondatabase/neon/pull/6573
- [ ] https://github.com/neondatabase/neon/pull/6574
- [ ] https://github.com/neondatabase/neon/issues/6630
- [x] measure impact in staging & prod => merge above preliminary work to get better observability
- [x] it's good, we wrote a blog post about it

### Follow-Ups
- [ ] https://github.com/neondatabase/neon/issues/6580

neondatabase / neon

pageserver: spawning walredo process is slow #6565

Problem

Investigation Results

DoD

Plan

Background Reading

Work

Spin-Offs (no need to complete before closing)