Open problame opened 4 months ago
Did some initial scouting work on this:
holes
functionality of compaction (introduced in https://github.com/neondatabase/neon/pull/3597) requires scanning of all keys before scanning all values
Status update:
Status update:
Plan / needs decision:
Status update: validation mode enabled in pre-prod
First night's prodlike cloudbench run had concurrent activity from another benchmark, smearing results: https://neondb.slack.com/archives/C06K38EB05D/p1723797560693199
However, here's the list of dashboards I looked at:
Preliminary interpretation (compare time range from 0:00 to 8:00, that's where the load happens)
Screenshot from the log scraping query, which I found quite insightful
What's the practical impact? 2x wall-clock-time-slower compactions means double the wait time on the global semaphore for compactions (assuming that semaphore is the practical throughput bottleneck, which I believe is the case). In other teams, it means we only achieve half the usual compaction throughput.
So, is prod compaction throughput bottlenecked on the global semaphore?
We can use the following query to approximate business of the semaphore (%age of tenants waiting for permit):
(pageserver_background_loop_semaphore_wait_start_count{instance="pageserver-8.eu-west-1.aws.neon.build",task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) pageserver_tenant_states_count{state="Active"}
There are some places where we have sampling skew, so, do clamping
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
the p99.9 instance in that plot looks like this
quantile(0.999,
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)
average like this
avg(
clamp(
(pageserver_background_loop_semaphore_wait_start_count{task="compaction"} - pageserver_background_loop_semaphore_wait_finish_count)
/on(instance) sum by (instance) (pageserver_tenant_states_count)
, 0, 1)
)
For posterity, there was a Slack thread discussing these results / next steps: https://neondb.slack.com/archives/C033RQ5SPDH/p1723810312846849
Decision from today's sync meeting:
This week, as per discussion thread:
Results from pre-prod are looking good.
Plan:
Results from rollout shared in this Slack thread
tl;dr:
sum by (neon_region) (rate(pageserver_storage_operations_seconds_global_sum{operation="compact",neon_region=~"$neon_region"}[$__rate_interval]))
/
sum by (neon_region) (rate(pageserver_wal_ingest_bytes_received[$__rate_interval] / 1e6))
The
compact_level0_phase1
currently usesValueRef::load
here, which internally usesread_blob
with theFileBlockReader
against the delta layer's VirtualFiles. This still goes through thePageCache
for the data pages.(We do use vectored get for
create_image_layers
, which also happens during compaction. But I missed thecompact_level0_phase1
.)Complete PageCache Bypass
We can extend the
load_keys
step here to also load the lengths of each blob into memory (instead of just the offset)https://github.com/neondatabase/neon/blob/9b98823d615c991422b6edd3ec3197192f763cf2/pageserver/src/tenant/timeline/compaction.rs#L498-L503
This allows us to go directly to the VirtualFile when we use the
ValueRef
here:https://github.com/neondatabase/neon/blob/9b98823d615c991422b6edd3ec3197192f763cf2/pageserver/src/tenant/timeline/compaction.rs#L623
The problem with this: we'd lose the hypothetical benefits of PageCache'ing the data block if multiple ValueRefs are on the same page.
Do we rely on the PageCache for performance in this case?
Yes, production shows we do have >80% hit rate for compaction, even on very busy pageservers. One instance by example:
Quick Fix 1: RequestContext-scoped mini page cache.
In earlier experiments, I used a RequestContext-scoped mini page cache for this.
Problem with this is that if more layers need to be compacted than we have pages in the page cache, it will start thrashing.
Proper Fix
Use streaming compaction with iterators where each iterator caches the current block.
We do have the diskbtree async stream now.
We could wrap that stream to provide a cache for the last-read block.