neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.99k stars 438 forks source link

bypass PageCache for `compact_level0_phase1` #8184

Open problame opened 4 months ago

problame commented 4 months ago
### Tasks
- [ ] https://github.com/neondatabase/neon/pull/8543
- [ ] https://github.com/neondatabase/aws/pull/1666
- [x] create staging log scraping alert for the warnings added: https://neonprod.grafana.net/alerting/grafana/cdtf6y667hatcc/view
- [x] create prod log scraping alert for the warnings added: https://neonprod.grafana.net/alerting/grafana/edtf7wry2qzggb/view
- [ ] https://github.com/neondatabase/aws/pull/1663
- [ ] https://github.com/neondatabase/aws/pull/1724
- [ ] https://github.com/neondatabase/aws/pull/1743
- [ ] https://github.com/neondatabase/infra/pull/1745
- [ ] https://github.com/neondatabase/neon/pull/8769
- [ ] https://github.com/neondatabase/infra/pull/1827
- [x] pre-prod perf evaluation: https://github.com/neondatabase/neon/issues/8184#issuecomment-2315672624
- [x] rollout: first regions
- [ ] https://github.com/neondatabase/infra/pull/1883
- [x] rollout: wait for deploy & inspect results
- [ ] https://github.com/neondatabase/infra/pull/1905
- [ ] https://github.com/neondatabase/neon/pull/8933
- [ ] https://github.com/neondatabase/neon/pull/8934
- [x] wait for deploy
- [ ] https://github.com/neondatabase/infra/pull/1903
- [ ] https://github.com/neondatabase/neon/pull/8935
- [ ] wait for deploy

The compact_level0_phase1 currently uses ValueRef::load here, which internally uses read_blob with the FileBlockReader against the delta layer's VirtualFiles. This still goes through the PageCache for the data pages.

(We do use vectored get for create_image_layers, which also happens during compaction. But I missed the compact_level0_phase1.)

Complete PageCache Bypass

We can extend the load_keys step here to also load the lengths of each blob into memory (instead of just the offset)

https://github.com/neondatabase/neon/blob/9b98823d615c991422b6edd3ec3197192f763cf2/pageserver/src/tenant/timeline/compaction.rs#L498-L503

This allows us to go directly to the VirtualFile when we use the ValueRef here:

https://github.com/neondatabase/neon/blob/9b98823d615c991422b6edd3ec3197192f763cf2/pageserver/src/tenant/timeline/compaction.rs#L623

The problem with this: we'd lose the hypothetical benefits of PageCache'ing the data block if multiple ValueRefs are on the same page.

Do we rely on the PageCache for performance in this case?

Yes, production shows we do have >80% hit rate for compaction, even on very busy pageservers. One instance by example:

image

Quick Fix 1: RequestContext-scoped mini page cache.

In earlier experiments, I used a RequestContext-scoped mini page cache for this.

Problem with this is that if more layers need to be compacted than we have pages in the page cache, it will start thrashing.

Proper Fix

Use streaming compaction with iterators where each iterator caches the current block.

We do have the diskbtree async stream now.

We could wrap that stream to provide a cache for the last-read block.

problame commented 3 months ago

Did some initial scouting work on this:

problame commented 3 months ago

Status update:

problame commented 2 months ago

Status update:

Plan / needs decision:

problame commented 2 months ago

Status update: validation mode enabled in pre-prod

Pre-Prod Analysis

First night's prodlike cloudbench run had concurrent activity from another benchmark, smearing results: https://neondb.slack.com/archives/C06K38EB05D/p1723797560693199

However, here's the list of dashboards I looked at:

Preliminary interpretation (compare time range from 0:00 to 8:00, that's where the load happens)

Screenshot from the log scraping query, which I found quite insightful Image

Can we enable it in prod?

What's the practical impact? 2x wall-clock-time-slower compactions means double the wait time on the global semaphore for compactions (assuming that semaphore is the practical throughput bottleneck, which I believe is the case). In other teams, it means we only achieve half the usual compaction throughput.

So, is prod compaction throughput bottlenecked on the global semaphore?

We can use the following query to approximate business of the semaphore (%age of tenants waiting for permit):

(pageserver_background_loop_semaphore_wait_start_count{instance="pageserver-8.eu-west-1.aws.neon.build",task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) pageserver_tenant_states_count{state="Active"}

There are some places where we have sampling skew, so, do clamping

clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)

Image

the p99.9 instance in that plot looks like this

quantile(0.999,
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)

Image

average like this

avg(
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)

Image

problame commented 2 months ago

For posterity, there was a Slack thread discussing these results / next steps: https://neondb.slack.com/archives/C033RQ5SPDH/p1723810312846849

problame commented 2 months ago

Decision from today's sync meeting:

  1. https://github.com/neondatabase/infra/pull/1745
  2. Create metric to measure semaphore contention.
  3. Table decision for remaining regions until EOW / next week.
problame commented 2 months ago

This week, as per discussion thread:

problame commented 2 months ago

Results from pre-prod are looking good.

Image

problame commented 2 months ago

Plan:

problame commented 2 months ago

Results from rollout shared in this Slack thread

tl;dr:

sum by (neon_region) (rate(pageserver_storage_operations_seconds_global_sum{operation="compact",neon_region=~"$neon_region"}[$__rate_interval]))
/
sum by (neon_region) (rate(pageserver_wal_ingest_bytes_received[$__rate_interval] / 1e6))