neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.27k stars 408 forks source link

Roll out safekeeper eviction with `--delete-offloaded-wal` #6338

Open petuhovskiy opened 8 months ago

petuhovskiy commented 8 months ago

Rough roll-out plan:

  1. Switch --enable-offload in staging regions, observe for ~week
  2. Switch --enable-offload in prod regions one by one, observe for ~week
  3. Switch --delete-offloaded-wal in staging regions, manually trigger uneviction
  4. Switch --delete-offloaded-wal in prod regions, manuall trigger uneviction
petuhovskiy commented 1 month ago

--enable-offload was enabled in one staging region for a week. It helped to discover some issues (https://neondb.slack.com/archives/C033RQ5SPDH/p1720601531744029), fixes PRs are waiting for the merge. But the main issue seems to be resources overloading, would be good to limit offloading to a lower rate, to reduce the load caused by it.

After fixes will be merged, we can deploy --enable-offload to a single prod region and verify it there. After verification --delete-offloaded-wal can be rolled out in staging, and then in all prod regions.

I'd say without any rush we can expect this to be rolled out everywhere in ~3 weeks.

petuhovskiy commented 1 month ago

Just talked with @arssher, it turns out pull_timeline interferes with eviction and needs more fixes. So my estimation probably shifts for 1+ weeks into the future.