neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.75k stars 428 forks source link

figure out why "use a single tokio runtime" PR had problems under prod-like cloudbench #7312

Open problame opened 6 months ago

problame commented 6 months ago

https://github.com/neondatabase/neon/issues/6628#issuecomment-2025015263

### Tasks
- [ ] https://github.com/neondatabase/neon/pull/7331
problame commented 6 months ago

spent 2h trying to assess the data we have

=> product

https://neonprod.grafana.net/d/cdhqnjut6dgqoe/2024-04-04-single-tokio-runtime-postmortem?orgId=1&from=1711386353000&to=1711470610000

snapshot as pdf for posterity

backup dashboard 2024-04-04 21-05 berlin.pdf

problame commented 6 months ago

Had an extensive session with @Bodobolero today, going over the results.

Update in this analysis dashboard panel

Summary:

Decision: we're going to try to do runs with lower number of projects involved until we find a point where pageserver isn't overloaded. "Is not overloaded" is defined as

  • the compaction iterations do not stall, i.e., they complete on time, no "took to long" log messages
  • latencies look more like in prod (<1ms), not like the 130ms we get right now
problame commented 6 months ago

Implemented an env-var configurable variant of the single tokio runtime patch for easier experimentaiton: https://github.com/neondatabase/neon/pull/7331

jcsp commented 6 months ago

Note: Peter's benchmark overloaded compaction even without single runtime change. Christian working with Peter to right-size the benchmark to reflect realistic load & will create new ticket for that.

problame commented 6 months ago

For the record, that issue is https://github.com/neondatabase/cloud/issues/12335

jcsp commented 3 weeks ago

Next actions when we pick this up: