Epic: per-tenant read path throttling

problame commented 9 months ago

Motivation

See #5648

tl;dr: we currently serve reads (and write path => #7564) as fast as possible. This sets the wrong incentives and poses operational and economic problems:

noisy neighbor issues: Pageservers are multi-tenant. But currently, a single tenant can max out the resources of a PS on both read and write path.
reads specifically: computes that should be increasing their DRAM can instead issue more GetPage requests (we currently don't charge for IOPS)

DoD

Pageserver artificially caps the per-tenant throughput on the read path. I.e., to all upstream Neon components, this cap will appear to be the maximum read performance that you can get per tenant per pageserver.

The limits will be chosen such that a TBD (small single-digit) number of tenants can run at the limit. Discovery of the limit values is done through gradual rollout, conservative experimentation, and informed by benchmarks.

The upstream (compute) responds to the limit-induced backpressure efficiently, gracefully, and without risk of starvation.

There is enough observability to clearly disambiguate slowness induced by limiting from slowness caused by otherwise slow pageserver. This disambiguation must be on per-tenant (better: per-timeline) granularity.

The throttle are on-by-default and cannot be permanently overridden on a per-tenant basis. I.e., the implementation need not be suited for productization as "performance tier" or "QoS" feature.

Interactions

Sharding: with sharding, above limits will be per shard instead of per tenant. However, we may need to (re-)introduce per-tenant limits within a single pageserver process to incentivize placement of shards across different nodes for increased performance & load spreading. However, that's subject to future work.

High-Level Plan

### High Level
- [x] implement get-page benchmark (#5771)
- [x] implement get-page throttling mechanism (RFC: #5648)
- [x] ship code to staging & prod (disabled at runtime, but, minimal overhead)
- [x] enable throughout staging with values slightly below limit, see how nightly benchmarks & peter's benchmarks react
- [x] decision on the whole metrics situation; goes hand in hand with decision on where the throttle should be applied: inside Timeline::get or higher upstack?
- [x] ~~selective enablement for the problematic 20k rps tenants that should just use neonvm, gain more experience~~
- [x] decision on location of the throttle: inside Timeline::get like now or one throttle per page_service endpoint; DECISION: PR #6953 declares it as future work, will remain inside Timeline::get for now
- [x] Pageserver Operations Page: https://www.notion.so/neondatabase/Pageserver-Per-Tenant-Read-Throttling-2b941a3e46234285949ee4a10366fbbc?pvs=4
- [ ] enable gradually, starting with high default, then going downwards to a value where we want to be

### Impl
- [ ] https://github.com/neondatabase/neon/pull/6640
- [ ] https://github.com/neondatabase/neon/pull/6706
- [ ] https://github.com/neondatabase/aws/pull/1048
- [ ] https://github.com/neondatabase/neon/pull/6869
- [ ] https://github.com/neondatabase/aws/pull/1054
- [ ] https://github.com/neondatabase/neon/pull/6953
- [ ] https://github.com/neondatabase/aws/pull/1124
- [ ] https://github.com/neondatabase/neon/pull/7072
- [ ] https://github.com/neondatabase/aws/pull/1125

jcsp commented 7 months ago

Status:

bumped last week by walredo latency
will start after io_uring stuff is wrapped.

jcsp commented 7 months ago

Initial draft PR is up for review -- could land this week.

Testing:

In staging we can enable throttling for everyone? Benchmarks will be unhappy. So we can set the limit to something very close to what the benchmarks currently do at peak.
Need to monitor impact on CPU usage from the overhead of throttling, while realizing that throttling will also decrease CPU usage due to lower ops/s

jcsp commented 6 months ago

Status:

Initial code landed on Friday
Next: enable for all tenants in staging, watch for ~2 days and overlap with Peter's bench.
If staging is OK, identify troublesome tenants in prod (k8s pod small ones) + work with compute/autoscaling team to remediate them. Target free tier users who are generating outsized I/O.

jcsp commented 6 months ago

This week:

Figure out our plan for making our latency metrics exclude the throttling time, and other metrics to measure impact/activity of throttling (currently just have logs)
Go with 20k in prod, with 40k burst limit: recognizing that this is higher than we want it to be ultimately, but we don't want to clamp things like full postgres restarts to a lower 5k limit.
notion page explaining how to adjust the throttle for a tenant in the field.

problame commented 6 months ago

get merged metrics changes beginning of week https://github.com/neondatabase/neon/pull/6953
other plans from last week: will have to wait until next week because we need the metrics changes
thread with team on future work quoted in PR #6953

jcsp commented 6 months ago

Status:

We alert if a tenant is throttled on >2% of requests
- Actions on the alerts:
- if a client is a legacy pod that should be migrated to an autoscaled VM, migrate it
- if a client is legitimately high throughput and large: shard it.
Reconciling with vecfored get changes to ensure we aren't double throttling.

problame commented 5 months ago

Apart from

Reconciling with vecfored get changes to ensure we aren't double throttling.

nothing happened last week.

This week:

investigate Sentry panic from prod
notion page explaining how to adjust the throttle for a tenant in the field.
examine amount of throttling in prod => take action (see previous comment)

problame commented 5 months ago

Status update:

Notion page done
Sentry panic => likely tokio issue, further updates by @VladLazar in #7162

This week:

examination of throttling in prod

problame commented 5 months ago

Status update:

no progress this week

problame commented 4 months ago

I split off the write throttling aspect of this epic into a separate draft epic: https://github.com/neondatabase/neon/issues/7564

(We do not expect to work on write throttling this quarter)

problame commented 2 weeks ago

Closing this epic, the development work has finished long ago.

The last item

enable gradually, starting with high default, then going downwards to a value where we want to be

was and still is dependent on sharding + sharded ingest rollout, so that users who hit the throttle have an option to acquire more IOPS through sharding as needed.

neondatabase / neon