problame commented 4 months ago

### Tasks
- [ ] https://github.com/neondatabase/neon/pull/8543
- [ ] https://github.com/neondatabase/aws/pull/1666
- [x] create staging log scraping alert for the warnings added: https://neonprod.grafana.net/alerting/grafana/cdtf6y667hatcc/view
- [x] create prod log scraping alert for the warnings added: https://neonprod.grafana.net/alerting/grafana/edtf7wry2qzggb/view
- [ ] https://github.com/neondatabase/aws/pull/1663
- [ ] https://github.com/neondatabase/aws/pull/1724
- [ ] https://github.com/neondatabase/aws/pull/1743
- [ ] https://github.com/neondatabase/infra/pull/1745
- [ ] https://github.com/neondatabase/neon/pull/8769
- [ ] https://github.com/neondatabase/infra/pull/1827
- [x] pre-prod perf evaluation: https://github.com/neondatabase/neon/issues/8184#issuecomment-2315672624
- [x] rollout: first regions
- [ ] https://github.com/neondatabase/infra/pull/1883
- [x] rollout: wait for deploy & inspect results
- [ ] https://github.com/neondatabase/infra/pull/1905
- [ ] https://github.com/neondatabase/neon/pull/8933
- [ ] https://github.com/neondatabase/neon/pull/8934
- [x] wait for deploy
- [ ] https://github.com/neondatabase/infra/pull/1903
- [ ] https://github.com/neondatabase/neon/pull/8935
- [ ] wait for deploy

The compact_level0_phase1 currently uses ValueRef::load here, which internally uses read_blob with the FileBlockReader against the delta layer's VirtualFiles. This still goes through the PageCache for the data pages.

(We do use vectored get for create_image_layers, which also happens during compaction. But I missed the compact_level0_phase1.)

Complete PageCache Bypass

We can extend the load_keys step here to also load the lengths of each blob into memory (instead of just the offset)

https://github.com/neondatabase/neon/blob/9b98823d615c991422b6edd3ec3197192f763cf2/pageserver/src/tenant/timeline/compaction.rs#L498-L503

This allows us to go directly to the VirtualFile when we use the ValueRef here:

https://github.com/neondatabase/neon/blob/9b98823d615c991422b6edd3ec3197192f763cf2/pageserver/src/tenant/timeline/compaction.rs#L623

The problem with this: we'd lose the hypothetical benefits of PageCache'ing the data block if multiple ValueRefs are on the same page.

Do we rely on the PageCache for performance in this case?

Yes, production shows we do have >80% hit rate for compaction, even on very busy pageservers. One instance by example:

Quick Fix 1: RequestContext-scoped mini page cache.

In earlier experiments, I used a RequestContext-scoped mini page cache for this.

Problem with this is that if more layers need to be compacted than we have pages in the page cache, it will start thrashing.

Proper Fix

Use streaming compaction with iterators where each iterator caches the current block.

We do have the diskbtree async stream now.

We could wrap that stream to provide a cache for the last-read block.

problame commented 3 months ago

Did some initial scouting work on this:

the holes functionality of compaction (introduced in https://github.com/neondatabase/neon/pull/3597) requires scanning of all keys before scanning all values
- tl;dr for what holes does:
- So the trade-off is that we rather create smaller L1s than creating sparse delta space on top of image layers.
- And if we're ingesting a lot of data on both sides of the hole, then this is definitely the right trade-off because we will have full-sized L1s on either side.
- But if we have little L0 data, then we create the small L1s.
- And all of this is necessary because we have the stupid count_deltas as the trigger for image layer creation.
- More details in private Slack DM
we need to preserve the holes functionality until we have a better approach for image layer creation at the top
testing: I can't find any dedicated Rust unit tests. Would be nice to extract the existing logic enough to get coverage for existing behaviors, but that's a lot of work.

problame commented 3 months ago

Status update:

Initial PR merged #8543
- Read the PR description, it contains a detailed rollout strategy.
default neon_local, unit tests, regression tests, and Staging are running a validation mode
- log scraping alerts for staging and prod are configured
- these will detect validation failures and forward to #on-call-storage-{staging,prod}-stream
post-merge of above PR, I ran some manual tests to inspect overhead of validation mode
- results here: https://github.com/neondatabase/neon/pull/8543#issuecomment-2260710812
pre-prod and prod require the following PR to be merged before the release next week:
- https://github.com/neondatabase/aws/pull/1663
- If we don't merge it, pre-prod and prod will run in validating mode as well, which we probably don't want yet.

problame commented 2 months ago

Status update:

No validation failures in Staging
Spot-checked flaky tests dashboard & error messages, no indication that the validation failure is happening there, either.
- https://github.com/neondatabase/cloud/issues/16404 would be better than spot-checking

Plan / needs decision:

Validation mode in pre-prod region? Measure CPU impact, see if low enough so we can enable validation mdoe in prod. Probablistic validation?
- https://github.com/neondatabase/aws/pull/1724

problame commented 2 months ago

Status update: validation mode enabled in pre-prod

Pre-Prod Analysis

First night's prodlike cloudbench run had concurrent activity from another benchmark, smearing results: https://neondb.slack.com/archives/C06K38EB05D/p1723797560693199

However, here's the list of dashboards I looked at:

Preliminary interpretation (compare time range from 0:00 to 8:00, that's where the load happens)

no noticable CPU / disk IOPS impact
but compaction iterations take about 2x wall clock time
- makes sense because validation does about twice the amount of VirtualFile calls, with no concurrency
- we're not bottlenecking on disk however

Screenshot from the log scraping query, which I found quite insightful

Can we enable it in prod?

What's the practical impact? 2x wall-clock-time-slower compactions means double the wait time on the global semaphore for compactions (assuming that semaphore is the practical throughput bottleneck, which I believe is the case). In other teams, it means we only achieve half the usual compaction throughput.

So, is prod compaction throughput bottlenecked on the global semaphore?

We can use the following query to approximate business of the semaphore (%age of tenants waiting for permit):

(pageserver_background_loop_semaphore_wait_start_count{instance="pageserver-8.eu-west-1.aws.neon.build",task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) pageserver_tenant_states_count{state="Active"}

There are some places where we have sampling skew, so, do clamping

clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)

the p99.9 instance in that plot looks like this

quantile(0.999,
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)

average like this

avg(
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)

problame commented 2 months ago

For posterity, there was a Slack thread discussing these results / next steps: https://neondb.slack.com/archives/C033RQ5SPDH/p1723810312846849

problame commented 2 months ago

Decision from today's sync meeting:

https://github.com/neondatabase/infra/pull/1745
Create metric to measure semaphore contention.
- https://github.com/neondatabase/neon/pull/8769
Table decision for remaining regions until EOW / next week.
- discussion thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1724404251430089

problame commented 2 months ago

This week, as per discussion thread:

analyze perf impact in pre-prod (enable new mode, without validation)
- AFTER qualifying this week's release
- https://github.com/neondatabase/infra/pull/1827
no changes to prod

problame commented 2 months ago

Results from pre-prod are looking good.

cloudbench results for that time range
Less wall clock time for the same workload (see screenshot below) => GOOD
unchanged memory for that time range unchange
unchanged CPU for that time range

problame commented 2 months ago

Plan:

Roll the non-validating mode into more prod regions this week.
- https://github.com/neondatabase/infra/pull/1883

problame commented 2 months ago

Results from rollout shared in this Slack thread

tl;dr:

halved the PS PageCache eviction rate, and stabilized it a lot
halved the metric "wall clock time spent on compaction / ingested bytes" (see query below)

sum by (neon_region) (rate(pageserver_storage_operations_seconds_global_sum{operation="compact",neon_region=~"$neon_region"}[$__rate_interval]))
/
sum by (neon_region) (rate(pageserver_wal_ingest_bytes_received[$__rate_interval] / 1e6))

neondatabase / neon

bypass PageCache for `compact_level0_phase1` #8184

Complete PageCache Bypass

Do we rely on the PageCache for performance in this case?

Quick Fix 1: RequestContext-scoped mini page cache.

Proper Fix

Pre-Prod Analysis

Can we enable it in prod?