Mitigate impact of TS reads on produce latency

twmb commented 1 year ago

Who is this for and what problem do they have today?

When a client is consuming the latest data, the consumer is only going to read as fast as the producer is producing. When a client wants to read historical data that has already been evicted to tiered storage, then Redpanda will read in segments as fast as the network and disk allow -- if the consumer is keeping up. Because Redpanda currently has no TS-rehydrating throttling, RP can rehydrate so quickly that it enters a bad overutilized state.

What are the success criteria?

With a configuration setting, we should be able to ensure that Redpanda does not consume more than X MB/s from TS.

Why is solving this problem impactful?

We are able to better protect the cluster and keep it healthy.

jcsp commented 1 year ago

We do have a limit on the number of concurrent readers, but this is designed to accommodate wide parallel reads.

In Redpanda >=23.1, tiered storage disk I/O runs at a lower priority than raft disk I/O -- this should help somewhat, although was done as a defensive measure and has not had empirical testing.

In Redpanda >=23.2, tiered storage reads work at a finer granularity and only promote "chunks" of data (16mib), so that reads do not promote so much data (previously it was whole segments) at once.

The trouble with throttling by bandwidth is the impact on consumers: for a ListOffsets to complete within a 5 second timeout on a topic with 100 partitions and 16MiB chunks, we need to be able to promote 320MiB/s. That scales up linearly with partition count in the query.

I think we have to be a bit more creative:

maybe that for time queries, we should dynamically use a smaller chunk size
time queries can be made much cheaper if we introduce a fuzzy mode where we only look up to the precision of the index, and not exact-batch precision. However, if the client is going to consume anyway from the results of the time query, we'll still have the deluge of reads when they issue the Fetch after the ListOffsets, so this only helps in cases where the client is for some reason timequery-ing without fetching.
for Fetch, we could make the Kafka layer aware of the relative cost of promoting data from different partitions (e.g. some partitions might already have data in cache), and focus on getting more data from a smaller number of partitions, rather than issuing a very wide read.

VladLazar commented 1 year ago

In some recent OMB runs it was noticed that cloud storage reads impacted the performance of the Raft append path. In brief, the cloud storage read path ate up too much disk IO and this resulted in high produce latency and leadership instability since appends start timing out.

In practice, a broker can hydrate a lot of data from cloud storage very fast with the default configuration. All of this hydrated data needs to land on disk first, so theres'a fair bit of write amplification. One solution is to start doing streaming reads when we detect the local disk is at capacity (note that these streaming reads should not touch the disk). In order to determine whether the disk is at capacity, we'd have to expose some stats from within Seastar's disk scheduler. Regardless of the approach we take it needs to be adaptive to some degree. Like John points outs in the comment above a "simple" bandwidth restriction runs into issues with wide reads.

Lazin commented 1 year ago

We found out that TS impact on producer latency is not high. The high impact that we saw previously was caused by the bug in OMB

Using the branch which contained the fix resulted in producer stability improving. Once the producer crashes were fixed, future tests were able to run for long periods of time with the produce and consume publish rates and latency stable throughout the test duration.

With a workload of 1200000 msg/sec and 128 producers, the usual values for produce and consume rates are between 800 to 900 MiB/s and 1600 to 1700 MiB/s with the OMB version with the fix. Throughout the runs the write ops were stable around 200K ops/sec. The p95 lacency for produce was close to 900ms while reading from tiered storage.

The full document is in our internal wiki Improving system performance when reading from cloud storage. The "wide" aspect of this problem is handled in a different issue so I'm closing this one.

toddfarmer commented 1 year ago

This issue is addressed in this merged PR. This will be released in v23.3 and not back-ported, due to introduced behavioral changes.

nvartolomei commented 11 months ago

There is still impact on produce latency when reading historical data regardless of whether tiered storage is in use. See https://github.com/redpanda-data/core-internal/issues/896 and linked Slack thread.

redpanda-data / redpanda