Tune zfs parameters for improved IO performance

faithanalog commented 3 months ago

We tune three zfs parameters:

zfs:zfs_vdev_aggregation_limit - disable
zfs:zfs_sync_taskq_batch_pct - reduce to 5%
zfs:zio_taskq_batch_pct - reduce to 5%

Like previous zfs tuning, these are about bringing zfs parameters in line with the reality of a modern system like the gimlet, away from defaults that were good on a system with few CPU threads and high-latency spinning disks.

For crucible with a 4k rand write workload, we see an 18% reduction in CPU usage and a 15% increase in throughput. We see no significant change to our 4k rand read or our 4k sequential workloads.

The crucible 4k rand write workload is the most pathological workload we have for zfs on the rack today. Therefore, I do not expect this change to have a negative impact on cockroach/clickhouse (if anything, it might help them out). I have not verified this experimentally.

I built a helios image with this change and booted it on a bench gimlet. I confirmed that the parameters have been changed as we expect:

EVT22200004 # mdb -ke 'zfs_vdev_aggregation_limit/D'
zfs_vdev_aggregation_limit:
zfs_vdev_aggregation_limit:     0
EVT22200004 # mdb -ke 'zfs_sync_taskq_batch_pct/D'
zfs_sync_taskq_batch_pct:
zfs_sync_taskq_batch_pct:       5
EVT22200004 # mdb -ke 'zio_taskq_batch_pct/D'
zio_taskq_batch_pct:
zio_taskq_batch_pct:            5

faithanalog commented 3 months ago

Therefore, I do not expect this change to have a negative impact on cockroach/clickhouse (if anything, it might help them out). I have not verified this experimentally.

I think, in general, we definitely want to look at all the other known I/O workloads we have before we make a change like this.

I think cockroach in particular would be nice to benchmark. I think we have some people who know how to do that.

jclulow commented 1 month ago

To return to this, I wanted to make sure it was clear that I wasn't looking for a deep new performance investigation, so much as a few smoke tests just to make sure that we're not going to have to revert it inside of a week. If you can get the change tested on london or madrid it seems fine to proceed.

I have filed two follow-up issues so that we don't lose track of things:

oxidecomputer/stlouis#622 to cover future work on the taskq bit, which I believe is important
16813 large zfs_vdev_aggregation_limit does not make sense for SSDs for the rotational vs SSD aggregation thing

faithanalog commented 1 month ago

Anything in particular you're looking for as that testing goes?

the VM load testing was already done on london/madrid. As API goes, that also exercises the control plane to the extent that my script is not very smart and frequently queries the IPs of 64 VMs with 64 parallel API requests, though it's not going out of its way to load down the API. the API is always a bit sluggish during the crucible load test, about equally so before and after this change. which I have been chalking up to system cpu/network load so far. have not noticed any changes there

faithanalog commented 1 month ago

half-serious: We could default rackletteadm to always apply these tuneables to everyone's setups and see if anyone complains

jclulow commented 1 month ago

No I think that sounds like enough testing. If it was not great under load to begin with and it's still not great now, though I think that's worth another ticket somewhere else, it seems fine to move forward.

oxidecomputer / helios

Tune zfs parameters for improved IO performance #170