Make it possible to set `append_chunk_size` on a per topic basis

rkruze commented 2 years ago

Who is this for, and what problem do they have today?

The default chunk size for Redpanda is 16KB. This means that for every IOP performed by Redpanda, 16 KB of data is written to disk. This default setting works very well for low latency cases where the hardware is not constrained by available IOPs.

However, if IOPs are constrained and latency is not a top priority, having a larger chunk size can result in higher throughput. For example, we have a machine with 16k IOPS available. (This is the max number of IOPs per GCP instance using pd-standard disks.)

16k IOPs x 16 KB = 256 MB/s 16k IOPs x 32 KB = 512 MB/s

What are the success criteria?

Users can set append_chunk_size per topic via rpk.

Why is solving this problem impactful?

With the current config option, you must set this at the node level to affect all topics/partitions on the cluster. Ideally, this would be set on a per topic basis, as you may have topics that are more critical to low latency and other topics focused on higher throughput. This would also allow testing with different options much more straightforward than changing the redpanda configuration and restarting the nodes.

Additional notes

JIRA Link: CORE-842

emaxerrno commented 2 years ago

i think this should be dynamic, and the engine should optimize this for the user. i think minimizing IO is a good objective. static chunk sizes i think is too hard to manage. i think the system needs to have some control-loop style and learn on a per-file-handle basis (segment_appender.cc)

emaxerrno commented 2 years ago

cc: @travisdowns

travisdowns commented 2 years ago

Yeah it does seem like dynamic chunk size subsume most of the "set per topic" cases in that if you were throughput limited the chunk size would rise naturally and this would avoid the need for per-topic tuning which I guess would have limited uptake.

It is also worth noting that our default chunk size of 16K doesn't mean all our IOs will be 16K: many can be smaller. As I understand it, when a batch is produced to a topic the leader will generally flush immediately, so as little as 4096 bytes will be written if the batch size is small. To average close to 16K you need fairly large batch sizes so that the write for a single produce batch is either just above a multiple of 16K or large enough that the "waste" is small.

rkruze commented 2 years ago

It is also worth noting that our default chunk size of 16K doesn't mean all our IOs will be 16K: many can be smaller. As I understand it, when a batch is produced to a topic the leader will generally flush immediately, so as little as 4096 bytes will be written if the batch size is small. To average close to 16K you need fairly large batch sizes so that the write for a single produce batch is either just above a multiple of 16K or large enough that the "waste" is small.

Fully understand, many times our batch sizes are about 128KB and can go up to 1MB in size so it would be pretty rare to see smaller batch sizes.

jcsp commented 2 years ago

when a batch is produced to a topic the leader will generally flush immediately, so as little as 4096 bytes will be written if the batch size is small

Yep -- the timeout for writing smaller than the chunk size is 1ms by default (segment_appender_flush_timeout_ms). I'm not sure our timeout logic is exactly right, we might be issuing smaller <chunk IOs than needed. But for a small+intermittent workload, we'll be issuing these smaller-than-chunk IOs frequently.

On dynamic adjustment of chunk size in segment_appender: I don't think this will work when thousands of segments are competing for the same drive's performance. Measurements of what's a "good" segment size at that locality will be hard/impossible. The individual segment can't distinguish between "IOs blocked because my chunk size is too small" and "IOs blocked because the drive is at bandwidth saturation" -- you have to take the whole-drive view for that.

Anyway, I'm not sure that the append size is the right setting to expose to users for their per-topic latency control. Here's an alternative suggestion:

Keep the global chunk size setting, and tune it at installation time. This chunk size is chosen to be just big enough to saturate drive bandwidth. Choose either with benchmark at install (existing seastar io tune?) or with prior knowledge (e.g. EBS).
Add a per-topic queue latency target, that supersedes segment_appender_flush_timeout_ms. Set it to zero for topics that want to do every write as fast as they can. Use this setting to select a different seastar IO priority based on some heuristics (e.g. something with a 0ms target should get a higher prio than others), so that the small IOs don't get stuck in queues behind the big IOs from bandwidth-oriented partitions.

travisdowns commented 2 years ago

Yep -- the timeout for writing smaller than the chunk size is 1ms by default (segment_appender_flush_timeout_ms). I'm not sure our timeout logic is exactly right, we might be issuing smaller <chunk IOs than needed. But for a small+intermittent workload, we'll be issuing these smaller-than-chunk IOs frequently.

I think it is 1000ms, not 1ms? See here (note the s in 1s). I was confused by this because empirically the flushes happen almost immediately if you trickle in requests.

After asking around, this is because acks=-1 will trigger an immediate flush, regardless of the segment appender flush timeout. So most acks=-1 requests will probably go to disk by themselves, at least on the leader. That's why typical latency is very good on an unloaded system, on the order of 1 ms, versus the 500 ms you'd expect if this flush interval was the main driver of the flush frequency.

Larger writes come from acks=1 or acks=0 produce requests, or batches that are large enough to exceed 4K in one shot, and possibly all writes on replicas (I am not sure if replicas also flush immediately with acks=-1).

Given this, I think a lot of the possible benefit for acks=-1, which I think is the most important case, can be obtained by the much simpler approach of just writing out big batches in larger chunks. E.g., if we get a 100K batch, don't chunk it into 16K chunks, but rather write it out in bigger chunks. There's no latency/bandwidth tradeoff here really because the entire write is available, so no need to guess about "linger time" etc.

dotnwat commented 2 years ago

I think it is 1000ms, not 1ms? See here (note the s in 1s). I was confused by this because empirically the flushes happen almost immediately if you trickle in requests.

Yes, I think the name of the timeout is misleading. The purpose of the timeout is to flush the write buffer to disk if the appender stops receiving IOs (e.g. the client workload went idle). Note that flush in this case isn't referring to an fsync, it is merely writing the data into the file to make it visible to readers.

redpanda-data / redpanda