Problem

The timeout-based batching adds latency to unbatchable workloads.

We can choose a short batching timeout (e.g. 10us) but that requires high-resolution timers, which tokio doesn't have. I thoroughly explored options to use OS timers (see this abandoned PR). In short, it's not an attractive option because any timer implementation adds non-trivial overheads.

Solution

The insight is that, in the steady state of a batchable workload, the time we spend in get_vectored will be hundreds of microseconds anyway.

If we prepare the next batch concurrently to get_vectored, we will have a sizeable batch ready once get_vectored of the current batch is done and do not need an explicit timeout.

This can be reasonably described as pipelining of the protocol handler.

Implementation

We model the sub-protocol handler for pagestream requests (handle_pagrequests) as three futures that form a pipeline:

Reading: read messages from pgb
Batching: fill the current batch
Execution: take the current batch, execute it using get_vectored, and send the response.

The Reading and Batching stage are conencted through an mpsc channel.

The Batching and Execution stage use a quirky construct to coordinate:

An Arc<std::sync::Mutex<Option<Box<BatchedFeMessage>>>> that represents the current batch.
A watch around it to notify Execution about new data.
a Notify to notify Batch about data consumed.
Inside the watch, a Mutex<BatchedFeMessage>

This construct allows the Execution stage to at any time, steal the current batch from Batching, using lock().unwrap().take().

Changes

Refactor handle_pagerequests
- separate functions for
  - reading one protocol message; produces a BatchedFeMessage with just one page request in it
  - batching; tried to merge an incoming BatchedFeMessage into an existing BatchedFeMessage; returns None on success and returns back the incoming message in case merging isn't possible
  - execution of a batched message
- unify the timeline handle acquisition & request span construction; it now happen in the function that reads the protocol message
Implement serial and pipelined model
- serial: what we had before any of the batching changes
  - read one protocol message
  - execute protocol messages
- pipelined: the design described above
  - optionality for execution of the pipeline: either via concurrent futures vs tokio tasks
Pageserver config
- remove batching timeout field
- add ability to configure max batch size (required for the rollout, cf https://github.com/neondatabase/cloud/issues/20620 )
- ability to configure execution mode
Tests
- remove batch_timeout parametrization
- rename test_getpage_merge_smoke to test_throughput
- add parametrization to test different max batch sizes and execution moes
- rename test_timer_precision to test_latency
- rename the teast case file to test_page_service_batching.py
- better descriptions of what the tests actually do

On the holding The `TimelineHandle` in the pending batch

While batching, we hold the TimelineHandle in the pending batch. Therefore, the timeline will not finish shutting down while we're batching.

This is not a problem because the get_vectored call will fail with an error indicating that the timeline is shutting down. This results in the Execution stage returning a QueryError::Shutdown, which causes the pipeline / entire page service connection to shut down. This drops all references to the Arc<Mutex<Option<Box<BatchedFeMessage>>>> object, thereby dropping the contained TimelineHandles.

=> fixes https://github.com/neondatabase/neon/issues/9850

Performance

Local run of the benchmarks, results in this empty commit in the PR branch.

Use commands like this to compare a particular metric in different configurations.

git show cbe18393d390961fc3dcf61287fdae2dcddcdf6b  | grep -E '(None|tasks)' | grep '.batching_factor'

Key take-aways:

concurrent-futures delivers higher batching_factor than tasks
- tail latency impact unknown, cf https://github.com/neondatabase/neon/issues/9837
concurrent-futures has lower CPU usage
throughput (time) is better with concurrent-futures except in the case of unbatchable workload with max batch size 1; in that case, tasks is 6% better but consume more CPU time for the same work
un-batchable latency impact is much better than what we saw with timeout-based batching
- mean: 117us => 120us (concurrent-futures) => 127us (task)
- tail latencies:
- concurrent-futures consistently slightly better than tasks, difference neglegible
- p99.9 and lower are approximately identical in all configurations
- p99.99 of serial is significantly better: 429us vs pipelined configurations are ~550us

Refs

epic: https://github.com/neondatabase/neon/issues/9376
this sub-task: https://github.com/neondatabase/neon/issues/9377
the abandoned attempt to improve batching timeout resolution: https://github.com/neondatabase/neon/pull/9820
closes https://github.com/neondatabase/neon/issues/9850
fixes https://github.com/neondatabase/neon/issues/9835

neondatabase / neon

page_service: rewrite batching to work without a timeout, pipeline in protocol handler instead #9851

Problem

Solution

Implementation

Changes

On the holding The `TimelineHandle` in the pending batch

Performance

Refs

5535 tests run: 5307 passed, 2 failed, 226 skipped (full report)

Failures on Postgres 17

Postgres 16

Postgres 15

Postgres 14

Test coverage report is not available

neondatabase / neon

page_service: rewrite batching to work without a timeout, pipeline in protocol handler instead #9851

Problem

Solution

Implementation

Changes

On the holding The TimelineHandle in the pending batch

Performance

Refs

5535 tests run: 5307 passed, 2 failed, 226 skipped (full report)

Failures on Postgres 17

Postgres 16

Postgres 15

Postgres 14

Test coverage report is not available

On the holding The `TimelineHandle` in the pending batch