Open alitvak69 opened 9 years ago
With primarycache=metadata you might suffer read/modify/write issues. How do you write to the dataset (block size)?
Could you try with a fresh dataset using the defaults for: primarycache (all) recordsize (128K) checksum (on) logbias (latency)
This was tried and didn't make difference, however
Our engineer spent 12 hours researching the topic
Bumping the parameters below helped on 0.6.4. I think he increased them 10 times, but one needs to play with it as mileage varies.
zfs_vdev_max_active zfs_vdev_sync_write_min_active zfs_vdev_async_write_max_active zfs_vdev_sync_write_max_active zfs_vdev_async_write_min_active
I hope it helps some one
Returning to the issue in hope some one will have time to respond. With 6.5.5 settings below speed up writing but reading is horribly slow over 10 Gb network
zfs_vdev_max_active zfs_vdev_sync_write_min_active zfs_vdev_async_write_max_active zfs_vdev_sync_write_max_active zfs_vdev_async_write_min_active
It looks like reading speed going down corresponds to rbd block device utilization going to100%. I am only doing copy or rsync to a local drive, nothing else is accessing partition with zfs on top of rbd block device.
Does anyone have a clue on where to start looking?
Close as stale.
If it's actual - feel free to reopen.
Looks like the problem persists http://list.zfsonlinux.org/pipermail/zfs-discuss/2018-February/030543.html , reopened.
It seems something goes wrong during previous benchmarks as bumping below parameters (in order of x10 default values) actually improve significantly sequential workloads.
zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_max_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active
The thing is I don't have any visibility about potentials downside of bumping those parameters, so YMMV.
@DaFresh the default values were experimentally determined to give good performance for pools constructed from hdd/ssd devices. It's entirely possible that these aren't the best values for rbd devices which may have very different performance characteristics. It would be great if you could post what values do work well for you so we could document those recommented tunning. The default values are:
/*
* The maximum number of i/os active to each device. Ideally, this will be >=
* the sum of each queue's max_active. It must be at least the sum of each
* queue's min_active.
*/
uint32_t zfs_vdev_max_active = 1000;
/*
* Per-queue limits on the number of i/os active to each device. If the
* number of active i/os is < zfs_vdev_max_active, then the min_active comes
* into play. We will send min_active from each queue, and then select from
* queues in the order defined by zio_priority_t.
*
* In general, smaller max_active's will lead to lower latency of synchronous
* operations. Larger max_active's may lead to higher overall throughput,
* depending on underlying storage.
*
* The ratio of the queues' max_actives determines the balance of performance
* between reads, writes, and scrubs. E.g., increasing
* zfs_vdev_scrub_max_active will cause the scrub or resilver to complete
* more quickly, but reads and writes to have higher latency and lower
* throughput.
*/
uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 2;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;
I've had some success with these values. At least, they were an improvement over the defaults. I haven't spent a lot of time tuning or testing.
# defaults given at the end of the line options zfs zfs_max_recordsize=4194304 # 1048576 options zfs zfs_vdev_async_read_max_active=18 # 3 options zfs zfs_vdev_async_write_max_active=60 # 10 options zfs zfs_vdev_scrub_max_active=12 # 2 options zfs zfs_vdev_sync_read_max_active=60 # 10 options zfs zfs_vdev_sync_write_max_active=60 # 10
The above are (mostly) ZIO scheduler tunables. More likely you need to adjust the write throttle tunables, as documented in the zfs-module-parameters man page section ZFS TRANSACTION DELAY.
But first... check the latency distribution from zpool iostat -w
and see if there are outliers in the high latency buckets. If you don't have a version with the -w
option, then you might try an external tool for measuring I/O latency, such as iolatency
https://github.com/brendangregg/perf-tools/blob/master/examples/iolatency_example.txt
@behlendorf RBD volumes latency behave differently than traditional hard drives, so this is what I ended to think too. Please find below parameters and values that give me much better performances:
options zfs zfs_vdev_max_active=10000
options zfs zfs_vdev_sync_read_max_active=100
options zfs zfs_vdev_sync_read_min_active=100
options zfs zfs_vdev_sync_write_max_active=100
options zfs zfs_vdev_sync_write_min_active=100
options zfs zfs_vdev_async_read_max_active=30
options zfs zfs_vdev_async_read_min_active=10
options zfs zfs_vdev_async_write_max_active=100
options zfs zfs_vdev_async_write_min_active=10
I will try more benchmarks, also with @tdb values, when I got some time. Note that I am not a big fan to diverge from default settings, more specifically with ZFS which as some internals magics, but it seems unavoidable in this context.
@richardelling by reading the ZFS TRANSACTION DELAY section and if my understanding is good, "zfs_delay_scale" alone would suffice to achieve the same behavior (except maybe for zfs_max_active) than previous tuning ?
To determine if zfsvdev__max_active is appropriately set, use zpool iostat -q [interval]
during your workload. If the number of "activ" I/Os is capped at the _max_active and the "pend" > "activ" for extended periods of time, then increasing _max_active allows more I/Os to be in-flight to the device.
Ultimately, performance is determined by the latency of the I/Os. So if the bandwidth-delay-product of the network (to/from RBD) is low, then injecting more concurrent I/Os (increasing *_max_active) can make more efficient use of the network. But this typically has no effect on the internal latency of the RBD device. This is why we use zpool iostat -w
to see the latency distribution first, then look at efficiency of the network with zpool iostat -q
second.
For the write throttle, as described in ZFS TRANSACTION DELAY, a good starting point is zfs_dirty_data_max
If increasing this value doesn't improve write performance, then the throttle likely isn't an issue. If increasing does improve performance, then the other related tunables come into play. The goal of the write throttle is to prevent performance from "falling off the cliff" when the device cannot quickly absorb the writes. For example, a device with a small write cache can perform very slowly when the write cache is filled, so the write throttle tries to apply backward pressure on the write workload generator to prevent cliffhangers.
So the basic approach is to determine if you need more or less concurrent I/O in the pipeline (*_max_active) vs more or less write throttling. Sometimes you need to tune both.
@richardelling, on the same systems with RBD vdev described by @DaFresh, we're still getting disappointing performance for a 99% read workload. Despite all attempts at tuning zfs_vdev_*_active
parameters. Your procedure on determining appropriate values using zpool iostat
makes much sense, I understand the logic of it, but looking at all the metrics it appears something else in ZFS is preventing concurrent I/Os.
zpool iostat -q
reveals we barely register any "pending" request with a sub-second interval. Going up will even hide active reads :
# zpool iostat -q 0.2
capacity operations bandwidth syncq_read syncq_write asyncq_read asyncq_write scrubq_read
pool alloc free read write read write pend activ pend activ pend activ pend activ pend activ
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
pool1 357T 56.7T 297 4 33.5M 19.8K 0 24 0 0 0 0 0 0 0 0
pool1 357T 56.7T 421 4 49.7M 19.8K 0 64 0 0 0 0 0 0 0 0
pool1 357T 56.7T 426 4 50.4M 19.8K 0 34 0 0 0 0 0 0 0 0
# zpool iostat -q 2
capacity operations bandwidth syncq_read syncq_write asyncq_read asyncq_write scrubq_read
pool alloc free read write read write pend activ pend activ pend activ pend activ pend activ
---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
pool1 357T 56.7T 220 32 22.8M 838K 0 0 0 0 0 0 0 0 0 0
pool1 357T 56.7T 244 91 26.8M 735K 0 2 0 0 0 0 0 0 0 0
pool1 357T 56.7T 213 4 22.6M 20.0K 0 0 0 0 0 0 0 0 0 0
iostat
is consistent with that lack of concurrency. As you can see avgqu-sz
is kept low by ZFS while handling lots of concurrent random reads.
# iostat -xm 2 vdb vdc
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vdb 0.00 0.00 102.00 1.50 9812.00 6.00 189.72 2.19 21.16 21.47 0.00 7.38 76.40
vdc 0.00 0.00 112.00 1.00 10882.00 4.00 192.67 2.35 20.99 21.18 0.00 7.45 84.20
You may find await
and %util
quite high here, but this CEPH backend can deliver good throughput with more in-flight requests. This was confirmed with fio
benchmarks and other filesystems.
So because zfs_vdev_*_active
didn't increase concurrency over the block device, I'm left with two leads:
zpool iostat -q
, all I/O are synchronous. It's not caused by the workload (NFS async export). I've found https://github.com/zfsonlinux/zfs/commit/b39c22b73c0e8016381057c2240570f7af992def, and #3833 from @behlendorf (https://github.com/zfsonlinux/zfs/pull/3833/commits/393ee23e40b89932a80f08592518d2df1de1b01e) which suggest READ_SYNC is enforced for non-rotationnal devices. RDB/virtio devices do have the queue/rotational
flag set to 1
and so I believe it's not affecting them. Could be wrong also.Do you have any insight on what can be changed to push as much read requests as possible to RBD ?
It's all about read performance for files in the range of 0.5 to 10 times the recordsize = 128k
.
Cheers
Adding a few printk
confirms that vdev_nonrot=0
on all RBD devices.
Also I didn't mention that sync=disabled
would not change the behaviour. Copying a file will issue sync reads only, although write from that copy are async.
I manged to get excellent sequential write performance by setting the following module parameters options zfs zfs_max_recordsize=4194304 options zfs zfs_vdev_aggregation_limit=4194304
For these to work, the large_blocks zpool feature must be enabled. This improves sequential reads and writes on ceph backed block devices by coalescing read and writes into 4MB blocks, which is the Ceph block size. This avoids a lot of read, modify write ops on the ceph side.
Useful thread! I got decent throughput in our small setup (three replicas, 7x12TB drives) using these values:
/etc/modprobe.d/zfs.conf
options zfs zfs_vdev_max_active=40000
options zfs zfs_vdev_sync_read_max_active=100
options zfs zfs_vdev_sync_read_min_active=100
options zfs zfs_vdev_sync_write_max_active=100
options zfs zfs_vdev_sync_write_min_active=100
options zfs zfs_vdev_async_read_max_active=20000
options zfs zfs_vdev_async_read_min_active=10
options zfs zfs_vdev_async_write_max_active=20000
options zfs zfs_vdev_async_write_min_active=10
options zfs zfs_max_recordsize=4194304
options zfs zfs_vdev_aggregation_limit=4194304
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
No news about this?
Useful thread! I got decent throughput in our small setup (three replicas, 7x12TB drives) using these values:
/etc/modprobe.d/zfs.conf
options zfs zfs_vdev_max_active=40000 options zfs zfs_vdev_sync_read_max_active=100 options zfs zfs_vdev_sync_read_min_active=100 options zfs zfs_vdev_sync_write_max_active=100 options zfs zfs_vdev_sync_write_min_active=100 options zfs zfs_vdev_async_read_max_active=20000 options zfs zfs_vdev_async_read_min_active=10 options zfs zfs_vdev_async_write_max_active=20000 options zfs zfs_vdev_async_write_min_active=10 options zfs zfs_max_recordsize=4194304 options zfs zfs_vdev_aggregation_limit=4194304
These options are doubled the performance. Pending operations and latency decreased.
TEST: 4GB tar ball write operation: ZFS RAID 1 SATA SSD: 0m9.9s TUNED ZFS RBD: 0m17.7s NON-TUNED ZFS RBD: 0m34.5s
But these tunes messed up Random RW performance. I'm still trying to implement further. Do you have any advice?
We're still using these values back from 2019 without any changes :/
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
We have abandon ZFS+RBD for now, changing our hosts to XFS. But from time to time we come back to checkup the updates.
@ozkangoksu , do you have a script to make these tests? I can replicate here and compare.
@arthurd2 Unfortunately it has been long time and I don't store fio scripts. I write for the use cases. I don't remember did I shared with the maillist while sharing the benchmark results. I will check when I got a free time.
We also don't use ZFS+RBD anymore because it's not efficient at all. To be honest, there is no way to make efficient ZFS over RBD. ZFS not designed for speed, it's designed for not to lose any data and CRC requires losing speed. Also RBD is not efficient, it's good for most cases but the response time makes it even harder for ZFS.
We also don't use ZFS+RBD anymore because it's not efficient at all. To be honest, there is no way to make efficient ZFS over RBD. ZFS not designed for speed, it's designed for not to lose any data and CRC requires losing speed. Also RBD is not efficient, it's good for most cases but the response time makes it even harder for ZFS.
I used to have ~20-30 MBps throughput on Mimic.
Now we have several Pacific/Quincy clusters and I'm on like a 1/10 of former throughput.
3MBs/sec are awful... tried to tune everything according to the manual:
https://openzfs.org/wiki/ZFS_on_high_latency_devices
looks like I have working aggregation.. but throughput is PAINFULLY slow :/
btw: I'm talking about saving ZFS snapshots (=backups) on RBD devices...
@behlendorf
Hello. I've encountered the same issue as mentioned above, tried using the parameters discussed in this thread, but without success. Additionally, while monitoring zpool iostat -q 1
, I noticed that the number of syncq_read
during random reads never exceeds 1. In other words, when testing with a depth of 1 or a depth of 32, I was getting a performance of around 1000-1500 IOPS, and in zfs iostat
, I observed a comparable latency of ~1ms-500us, corresponding to the tests with a depth of 1.
If using a depth of 1 and 32 threads, the number of requests matched the number of active synchronous read operations in zfs iostat -r
, and it was 32. To address this issue, I experimented with zvol
, and it worked. When testing with a single thread and a depth of 32 on a raw zvol
, I achieved similar performance to testing ZFS with 32 threads and a depth of 1 - approximately 10K IOPS.
Next, I tested creating a zpool
over a zvol
, and I encountered the same issue with a depth limitation of 1. Then, I formatted the zvol
to XFS, created and populated a test file, and ran random read tests. The results matched the previous test with 32 threads and a depth of 1, totaling around 10K IOPS.
In conclusion, using zvol
resolves the problem for RBD because zvol
facilitates request aggregation.
All tests were conducted with primarycache=metadata
.
OS: Ubuntu 22.04 Ceph: Quincy, 17.2.6 ZFS: 2.1.5
How can this experience be applied to ZFS over RBD without using zvol
?
If not possible, how safe is it to use large-sized zvol
s, for example, 10-20-30 TB?
Which file system is best for a large zvol?
@behlendorf
Hello. I've encountered the same issue as mentioned above, tried using the parameters discussed in this thread, but without success. Additionally, while monitoring
zpool iostat -q 1
, I noticed that the number ofsyncq_read
during random reads never exceeds 1. In other words, when testing with a depth of 1 or a depth of 32, I was getting a performance of around 1000-1500 IOPS, and inzfs iostat
, I observed a comparable latency of ~1ms-500us, corresponding to the tests with a depth of 1.If using a depth of 1 and 32 threads, the number of requests matched the number of active synchronous read operations in
zfs iostat -r
, and it was 32. To address this issue, I experimented withzvol
, and it worked. When testing with a single thread and a depth of 32 on a rawzvol
, I achieved similar performance to testing ZFS with 32 threads and a depth of 1 - approximately 10K IOPS.Next, I tested creating a
zpool
over azvol
, and I encountered the same issue with a depth limitation of 1. Then, I formatted thezvol
to XFS, created and populated a test file, and ran random read tests. The results matched the previous test with 32 threads and a depth of 1, totaling around 10K IOPS.In conclusion, using
zvol
resolves the problem for RBD becausezvol
facilitates request aggregation. All tests were conducted withprimarycache=metadata
.OS: Ubuntu 22.04 Ceph: Quincy, 17.2.6 ZFS: 2.1.5
How can this experience be applied to ZFS over RBD without using
zvol
? If not possible, how safe is it to use large-sizedzvol
s, for example, 10-20-30 TB? Which file system is best for a large zvol?
In general, if anyone is having problems with poor performance in ZFS via RBD, then you most likely have 1 thread 1 IOPS performance (regardless of the query depth you set).
Performance on such a configuration scales with threads. To address this issue, you either need to use more threads or aggregate the depth of the requests. I found two solutions for aggregating depth:
zvol
: The depth is regulated by the number of threads /sys/module/zfs/parameters/zvol_threads
(default is 32).nfs server
: By default, the number of processes in most distributions is 8. When you change the number of nfsd processes, the number of IOPS increases due to the increase in the number of nfsd threads. For local use, mount NFS via the loopback interface.
For most distributions, the parameters for the number of nfs server processes are located in /etc/default.
For Ubuntu 22.04 and higher, read this manual.For zvol by default (32 threads) and nfs server (32 processes) with these parameters, I got maximum performance:
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_aggregation_limit=1048576
options zfs zfs_vdev_aggregation_limit_non_rotating=1048576
options zfs zfs_dirty_data_max=1342177280
this issue seems to be lingering ... Has anyone done structured testing to determine appropriate settings for sensible performance of a setup with rbd backed zfs ? The features of such a combo are just to perfect to leave it in this state.
With that many tuneables, I am sure there must be a good set to make these setups work nicely. Anyone interested in a concerted effort to get the performance issues defined and sorted ?
Maybe writing a script which automates the approach defined in https://openzfs.org/wiki/ZFS_on_high_latency_devices would be a solution ...
Sorry for double post. I know fixing 0.6.4 takes priority over everything else, however I decided to post this question in hope that one of the developers will give me a hint where to start.
When testing our ceph cluster I found a very strange problem. When I create a zfs file system on the top of rbd dev /dev/rbd0, no matter what tweaks I do I cannot exceed 30 MB / Sec on 1 Gbit pipe. Set sync disabled has no effect. When I use xfs on the same device I come close to saturating 1 Gbit, i.e. I am writing at 109 MB / sec
I don't have compression enabled on zfs so I could see a real throughput.
Can some one help to explain this?
zfs get all rbdlog2/cephlogs NAME PROPERTY VALUE SOURCE rbdlog2/cephlogs type filesystem - rbdlog2/cephlogs creation Sun Apr 19 9:46 2015 - rbdlog2/cephlogs used 4.62G - rbdlog2/cephlogs available 995G - rbdlog2/cephlogs referenced 4.62G - rbdlog2/cephlogs compressratio 1.00x - rbdlog2/cephlogs mounted yes - rbdlog2/cephlogs quota none default rbdlog2/cephlogs reservation none default rbdlog2/cephlogs recordsize 32K inherited from rbdlog2 rbdlog2/cephlogs mountpoint /cephlogs local rbdlog2/cephlogs sharenfs off default rbdlog2/cephlogs checksum fletcher4 inherited from rbdlog2 rbdlog2/cephlogs compression off default rbdlog2/cephlogs atime off inherited from rbdlog2 rbdlog2/cephlogs devices on default rbdlog2/cephlogs exec on default rbdlog2/cephlogs setuid on default rbdlog2/cephlogs readonly off default rbdlog2/cephlogs zoned off default rbdlog2/cephlogs snapdir hidden default rbdlog2/cephlogs aclinherit restricted default rbdlog2/cephlogs canmount on default rbdlog2/cephlogs xattr sa inherited from rbdlog2 rbdlog2/cephlogs copies 1 default rbdlog2/cephlogs version 5 - rbdlog2/cephlogs utf8only off - rbdlog2/cephlogs normalization none - rbdlog2/cephlogs casesensitivity sensitive - rbdlog2/cephlogs vscan off default rbdlog2/cephlogs nbmand off default rbdlog2/cephlogs sharesmb off default rbdlog2/cephlogs refquota none default rbdlog2/cephlogs refreservation none default rbdlog2/cephlogs primarycache metadata local rbdlog2/cephlogs secondarycache metadata inherited from rbdlog2 rbdlog2/cephlogs usedbysnapshots 0 - rbdlog2/cephlogs usedbydataset 4.62G - rbdlog2/cephlogs usedbychildren 0 - rbdlog2/cephlogs usedbyrefreservation 0 - rbdlog2/cephlogs logbias throughput local rbdlog2/cephlogs dedup off default rbdlog2/cephlogs mlslabel none default rbdlog2/cephlogs sync disabled inherited from rbdlog2 rbdlog2/cephlogs refcompressratio 1.00x - rbdlog2/cephlogs written 4.62G - rbdlog2/cephlogs logicalused 4.62G - rbdlog2/cephlogs logicalreferenced 4.62G - rbdlog2/cephlogs snapdev hidden default rbdlog2/cephlogs acltype off default rbdlog2/cephlogs context none default rbdlog2/cephlogs fscontext none default rbdlog2/cephlogs defcontext none default rbdlog2/cephlogs rootcontext none default rbdlog2/cephlogs relatime off default rbdlog2/cephlogs redundant_metadata all default rbdlog2/cephlogs overlay off default
zpool get all NAME PROPERTY VALUE SOURCE rbdlog2 size 1016G - rbdlog2 capacity 0% - rbdlog2 altroot - default rbdlog2 health ONLINE - rbdlog2 guid 12884943537457662683 default rbdlog2 version - default rbdlog2 bootfs - default rbdlog2 delegation on default rbdlog2 autoreplace off default rbdlog2 cachefile - default rbdlog2 failmode wait default rbdlog2 listsnapshots off default rbdlog2 autoexpand off default rbdlog2 dedupditto 0 default rbdlog2 dedupratio 1.00x - rbdlog2 free 1011G - rbdlog2 allocated 4.63G - rbdlog2 readonly off - rbdlog2 ashift 13 local rbdlog2 comment - default rbdlog2 expandsize - - rbdlog2 freeing 0 default rbdlog2 fragmentation 0% - rbdlog2 leaked 0 default rbdlog2 feature@async_destroy enabled local rbdlog2 feature@empty_bpobj active local rbdlog2 feature@lz4_compress active local rbdlog2 feature@spacemap_histogram active local rbdlog2 feature@enabled_txg active local rbdlog2 feature@hole_birth active local rbdlog2 feature@extensible_dataset enabled local rbdlog2 feature@embedded_data active local rbdlog2 feature@bookmarks enabled local
Some settings are result of my tweaking and can be changed back