openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.48k stars 1.74k forks source link

ZFS low throughput on rbd based vdev #3324

Open alitvak69 opened 9 years ago

alitvak69 commented 9 years ago

Sorry for double post. I know fixing 0.6.4 takes priority over everything else, however I decided to post this question in hope that one of the developers will give me a hint where to start.

When testing our ceph cluster I found a very strange problem. When I create a zfs file system on the top of rbd dev /dev/rbd0, no matter what tweaks I do I cannot exceed 30 MB / Sec on 1 Gbit pipe. Set sync disabled has no effect. When I use xfs on the same device I come close to saturating 1 Gbit, i.e. I am writing at 109 MB / sec

I don't have compression enabled on zfs so I could see a real throughput.

Can some one help to explain this?

zfs get all rbdlog2/cephlogs NAME PROPERTY VALUE SOURCE rbdlog2/cephlogs type filesystem - rbdlog2/cephlogs creation Sun Apr 19 9:46 2015 - rbdlog2/cephlogs used 4.62G - rbdlog2/cephlogs available 995G - rbdlog2/cephlogs referenced 4.62G - rbdlog2/cephlogs compressratio 1.00x - rbdlog2/cephlogs mounted yes - rbdlog2/cephlogs quota none default rbdlog2/cephlogs reservation none default rbdlog2/cephlogs recordsize 32K inherited from rbdlog2 rbdlog2/cephlogs mountpoint /cephlogs local rbdlog2/cephlogs sharenfs off default rbdlog2/cephlogs checksum fletcher4 inherited from rbdlog2 rbdlog2/cephlogs compression off default rbdlog2/cephlogs atime off inherited from rbdlog2 rbdlog2/cephlogs devices on default rbdlog2/cephlogs exec on default rbdlog2/cephlogs setuid on default rbdlog2/cephlogs readonly off default rbdlog2/cephlogs zoned off default rbdlog2/cephlogs snapdir hidden default rbdlog2/cephlogs aclinherit restricted default rbdlog2/cephlogs canmount on default rbdlog2/cephlogs xattr sa inherited from rbdlog2 rbdlog2/cephlogs copies 1 default rbdlog2/cephlogs version 5 - rbdlog2/cephlogs utf8only off - rbdlog2/cephlogs normalization none - rbdlog2/cephlogs casesensitivity sensitive - rbdlog2/cephlogs vscan off default rbdlog2/cephlogs nbmand off default rbdlog2/cephlogs sharesmb off default rbdlog2/cephlogs refquota none default rbdlog2/cephlogs refreservation none default rbdlog2/cephlogs primarycache metadata local rbdlog2/cephlogs secondarycache metadata inherited from rbdlog2 rbdlog2/cephlogs usedbysnapshots 0 - rbdlog2/cephlogs usedbydataset 4.62G - rbdlog2/cephlogs usedbychildren 0 - rbdlog2/cephlogs usedbyrefreservation 0 - rbdlog2/cephlogs logbias throughput local rbdlog2/cephlogs dedup off default rbdlog2/cephlogs mlslabel none default rbdlog2/cephlogs sync disabled inherited from rbdlog2 rbdlog2/cephlogs refcompressratio 1.00x - rbdlog2/cephlogs written 4.62G - rbdlog2/cephlogs logicalused 4.62G - rbdlog2/cephlogs logicalreferenced 4.62G - rbdlog2/cephlogs snapdev hidden default rbdlog2/cephlogs acltype off default rbdlog2/cephlogs context none default rbdlog2/cephlogs fscontext none default rbdlog2/cephlogs defcontext none default rbdlog2/cephlogs rootcontext none default rbdlog2/cephlogs relatime off default rbdlog2/cephlogs redundant_metadata all default rbdlog2/cephlogs overlay off default

zpool get all NAME PROPERTY VALUE SOURCE rbdlog2 size 1016G - rbdlog2 capacity 0% - rbdlog2 altroot - default rbdlog2 health ONLINE - rbdlog2 guid 12884943537457662683 default rbdlog2 version - default rbdlog2 bootfs - default rbdlog2 delegation on default rbdlog2 autoreplace off default rbdlog2 cachefile - default rbdlog2 failmode wait default rbdlog2 listsnapshots off default rbdlog2 autoexpand off default rbdlog2 dedupditto 0 default rbdlog2 dedupratio 1.00x - rbdlog2 free 1011G - rbdlog2 allocated 4.63G - rbdlog2 readonly off - rbdlog2 ashift 13 local rbdlog2 comment - default rbdlog2 expandsize - - rbdlog2 freeing 0 default rbdlog2 fragmentation 0% - rbdlog2 leaked 0 default rbdlog2 feature@async_destroy enabled local rbdlog2 feature@empty_bpobj active local rbdlog2 feature@lz4_compress active local rbdlog2 feature@spacemap_histogram active local rbdlog2 feature@enabled_txg active local rbdlog2 feature@hole_birth active local rbdlog2 feature@extensible_dataset enabled local rbdlog2 feature@embedded_data active local rbdlog2 feature@bookmarks enabled local

Some settings are result of my tweaking and can be changed back

GregorKopka commented 9 years ago

With primarycache=metadata you might suffer read/modify/write issues. How do you write to the dataset (block size)?

Could you try with a fresh dataset using the defaults for: primarycache (all) recordsize (128K) checksum (on) logbias (latency)

alitvak69 commented 9 years ago

This was tried and didn't make difference, however

Our engineer spent 12 hours researching the topic

Bumping the parameters below helped on 0.6.4. I think he increased them 10 times, but one needs to play with it as mileage varies.

zfs_vdev_max_active zfs_vdev_sync_write_min_active zfs_vdev_async_write_max_active zfs_vdev_sync_write_max_active zfs_vdev_async_write_min_active

I hope it helps some one

alitvak69 commented 8 years ago

Returning to the issue in hope some one will have time to respond. With 6.5.5 settings below speed up writing but reading is horribly slow over 10 Gb network

zfs_vdev_max_active zfs_vdev_sync_write_min_active zfs_vdev_async_write_max_active zfs_vdev_sync_write_max_active zfs_vdev_async_write_min_active

It looks like reading speed going down corresponds to rbd block device utilization going to100%. I am only doing copy or rsync to a local drive, nothing else is accessing partition with zfs on top of rbd block device.

Does anyone have a clue on where to start looking?

gmelikov commented 7 years ago

Close as stale.

If it's actual - feel free to reopen.

gmelikov commented 6 years ago

Looks like the problem persists http://list.zfsonlinux.org/pipermail/zfs-discuss/2018-February/030543.html , reopened.

happycouak commented 6 years ago

It seems something goes wrong during previous benchmarks as bumping below parameters (in order of x10 default values) actually improve significantly sequential workloads.

zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_max_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active

The thing is I don't have any visibility about potentials downside of bumping those parameters, so YMMV.

behlendorf commented 6 years ago

@DaFresh the default values were experimentally determined to give good performance for pools constructed from hdd/ssd devices. It's entirely possible that these aren't the best values for rbd devices which may have very different performance characteristics. It would be great if you could post what values do work well for you so we could document those recommented tunning. The default values are:

/*
 * The maximum number of i/os active to each device.  Ideally, this will be >=
 * the sum of each queue's max_active.  It must be at least the sum of each
 * queue's min_active.
 */
uint32_t zfs_vdev_max_active = 1000;

/*
 * Per-queue limits on the number of i/os active to each device.  If the
 * number of active i/os is < zfs_vdev_max_active, then the min_active comes
 * into play. We will send min_active from each queue, and then select from
 * queues in the order defined by zio_priority_t.
 *
 * In general, smaller max_active's will lead to lower latency of synchronous
 * operations.  Larger max_active's may lead to higher overall throughput,
 * depending on underlying storage.
 *
 * The ratio of the queues' max_actives determines the balance of performance
 * between reads, writes, and scrubs.  E.g., increasing
 * zfs_vdev_scrub_max_active will cause the scrub or resilver to complete
 * more quickly, but reads and writes to have higher latency and lower
 * throughput.
 */
uint32_t zfs_vdev_sync_read_min_active = 10;
uint32_t zfs_vdev_sync_read_max_active = 10;
uint32_t zfs_vdev_sync_write_min_active = 10;
uint32_t zfs_vdev_sync_write_max_active = 10;
uint32_t zfs_vdev_async_read_min_active = 1;
uint32_t zfs_vdev_async_read_max_active = 3;
uint32_t zfs_vdev_async_write_min_active = 2;
uint32_t zfs_vdev_async_write_max_active = 10;
uint32_t zfs_vdev_scrub_min_active = 1;
uint32_t zfs_vdev_scrub_max_active = 2;
tdb commented 6 years ago

I've had some success with these values. At least, they were an improvement over the defaults. I haven't spent a lot of time tuning or testing.

# defaults given at the end of the line
options zfs zfs_max_recordsize=4194304          # 1048576
options zfs zfs_vdev_async_read_max_active=18   # 3
options zfs zfs_vdev_async_write_max_active=60  # 10
options zfs zfs_vdev_scrub_max_active=12        # 2
options zfs zfs_vdev_sync_read_max_active=60    # 10
options zfs zfs_vdev_sync_write_max_active=60   # 10
richardelling commented 6 years ago

The above are (mostly) ZIO scheduler tunables. More likely you need to adjust the write throttle tunables, as documented in the zfs-module-parameters man page section ZFS TRANSACTION DELAY.

But first... check the latency distribution from zpool iostat -w and see if there are outliers in the high latency buckets. If you don't have a version with the -w option, then you might try an external tool for measuring I/O latency, such as iolatency https://github.com/brendangregg/perf-tools/blob/master/examples/iolatency_example.txt

happycouak commented 6 years ago

@behlendorf RBD volumes latency behave differently than traditional hard drives, so this is what I ended to think too. Please find below parameters and values that give me much better performances:

options zfs zfs_vdev_max_active=10000
options zfs zfs_vdev_sync_read_max_active=100
options zfs zfs_vdev_sync_read_min_active=100
options zfs zfs_vdev_sync_write_max_active=100
options zfs zfs_vdev_sync_write_min_active=100
options zfs zfs_vdev_async_read_max_active=30
options zfs zfs_vdev_async_read_min_active=10
options zfs zfs_vdev_async_write_max_active=100
options zfs zfs_vdev_async_write_min_active=10

I will try more benchmarks, also with @tdb values, when I got some time. Note that I am not a big fan to diverge from default settings, more specifically with ZFS which as some internals magics, but it seems unavoidable in this context.

@richardelling by reading the ZFS TRANSACTION DELAY section and if my understanding is good, "zfs_delay_scale" alone would suffice to achieve the same behavior (except maybe for zfs_max_active) than previous tuning ?

richardelling commented 6 years ago

To determine if zfsvdev__max_active is appropriately set, use zpool iostat -q [interval] during your workload. If the number of "activ" I/Os is capped at the _max_active and the "pend" > "activ" for extended periods of time, then increasing _max_active allows more I/Os to be in-flight to the device.

Ultimately, performance is determined by the latency of the I/Os. So if the bandwidth-delay-product of the network (to/from RBD) is low, then injecting more concurrent I/Os (increasing *_max_active) can make more efficient use of the network. But this typically has no effect on the internal latency of the RBD device. This is why we use zpool iostat -w to see the latency distribution first, then look at efficiency of the network with zpool iostat -q second.

For the write throttle, as described in ZFS TRANSACTION DELAY, a good starting point is zfs_dirty_data_max If increasing this value doesn't improve write performance, then the throttle likely isn't an issue. If increasing does improve performance, then the other related tunables come into play. The goal of the write throttle is to prevent performance from "falling off the cliff" when the device cannot quickly absorb the writes. For example, a device with a small write cache can perform very slowly when the write cache is filled, so the write throttle tries to apply backward pressure on the write workload generator to prevent cliffhangers.

So the basic approach is to determine if you need more or less concurrent I/O in the pipeline (*_max_active) vs more or less write throttling. Sometimes you need to tune both.

npdgm commented 6 years ago

@richardelling, on the same systems with RBD vdev described by @DaFresh, we're still getting disappointing performance for a 99% read workload. Despite all attempts at tuning zfs_vdev_*_active parameters. Your procedure on determining appropriate values using zpool iostat makes much sense, I understand the logic of it, but looking at all the metrics it appears something else in ZFS is preventing concurrent I/Os.

zpool iostat -q reveals we barely register any "pending" request with a sub-second interval. Going up will even hide active reads :

# zpool iostat -q 0.2
              capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
pool        alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
pool1        357T  56.7T    297      4  33.5M  19.8K      0     24      0      0      0      0      0      0      0      0
pool1        357T  56.7T    421      4  49.7M  19.8K      0     64      0      0      0      0      0      0      0      0
pool1        357T  56.7T    426      4  50.4M  19.8K      0     34      0      0      0      0      0      0      0      0

# zpool iostat -q 2
              capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read
pool        alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
pool1        357T  56.7T    220     32  22.8M   838K      0      0      0      0      0      0      0      0      0      0
pool1        357T  56.7T    244     91  26.8M   735K      0      2      0      0      0      0      0      0      0      0
pool1        357T  56.7T    213      4  22.6M  20.0K      0      0      0      0      0      0      0      0      0      0

iostat is consistent with that lack of concurrency. As you can see avgqu-sz is kept low by ZFS while handling lots of concurrent random reads.

# iostat -xm 2 vdb vdc
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vdb               0.00     0.00  102.00    1.50  9812.00     6.00   189.72     2.19   21.16   21.47    0.00   7.38  76.40
vdc               0.00     0.00  112.00    1.00 10882.00     4.00   192.67     2.35   20.99   21.18    0.00   7.45  84.20

You may find await and %util quite high here, but this CEPH backend can deliver good throughput with more in-flight requests. This was confirmed with fio benchmarks and other filesystems.

So because zfs_vdev_*_active didn't increase concurrency over the block device, I'm left with two leads:

Do you have any insight on what can be changed to push as much read requests as possible to RBD ? It's all about read performance for files in the range of 0.5 to 10 times the recordsize = 128k.

Cheers

npdgm commented 6 years ago

Adding a few printk confirms that vdev_nonrot=0 on all RBD devices. Also I didn't mention that sync=disabled would not change the behaviour. Copying a file will issue sync reads only, although write from that copy are async.

peterska commented 5 years ago

I manged to get excellent sequential write performance by setting the following module parameters options zfs zfs_max_recordsize=4194304 options zfs zfs_vdev_aggregation_limit=4194304

For these to work, the large_blocks zpool feature must be enabled. This improves sequential reads and writes on ceph backed block devices by coalescing read and writes into 4MB blocks, which is the Ceph block size. This avoids a lot of read, modify write ops on the ceph side.

prghix commented 5 years ago

Useful thread! I got decent throughput in our small setup (three replicas, 7x12TB drives) using these values:

/etc/modprobe.d/zfs.conf

options zfs zfs_vdev_max_active=40000
options zfs zfs_vdev_sync_read_max_active=100
options zfs zfs_vdev_sync_read_min_active=100
options zfs zfs_vdev_sync_write_max_active=100
options zfs zfs_vdev_sync_write_min_active=100
options zfs zfs_vdev_async_read_max_active=20000
options zfs zfs_vdev_async_read_min_active=10
options zfs zfs_vdev_async_write_max_active=20000
options zfs zfs_vdev_async_write_min_active=10
options zfs zfs_max_recordsize=4194304
options zfs zfs_vdev_aggregation_limit=4194304
stale[bot] commented 4 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

arthurd2 commented 3 years ago

No news about this?

ozkangoksu commented 3 years ago

Useful thread! I got decent throughput in our small setup (three replicas, 7x12TB drives) using these values:

/etc/modprobe.d/zfs.conf

options zfs zfs_vdev_max_active=40000
options zfs zfs_vdev_sync_read_max_active=100
options zfs zfs_vdev_sync_read_min_active=100
options zfs zfs_vdev_sync_write_max_active=100
options zfs zfs_vdev_sync_write_min_active=100
options zfs zfs_vdev_async_read_max_active=20000
options zfs zfs_vdev_async_read_min_active=10
options zfs zfs_vdev_async_write_max_active=20000
options zfs zfs_vdev_async_write_min_active=10
options zfs zfs_max_recordsize=4194304
options zfs zfs_vdev_aggregation_limit=4194304

These options are doubled the performance. Pending operations and latency decreased.

TEST: 4GB tar ball write operation: ZFS RAID 1 SATA SSD: 0m9.9s TUNED ZFS RBD: 0m17.7s NON-TUNED ZFS RBD: 0m34.5s

But these tunes messed up Random RW performance. I'm still trying to implement further. Do you have any advice?

prghix commented 3 years ago

We're still using these values back from 2019 without any changes :/

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

arthurd2 commented 1 year ago

We have abandon ZFS+RBD for now, changing our hosts to XFS. But from time to time we come back to checkup the updates.

@ozkangoksu , do you have a script to make these tests? I can replicate here and compare.

ozkangoksu commented 1 year ago

@arthurd2 Unfortunately it has been long time and I don't store fio scripts. I write for the use cases. I don't remember did I shared with the maillist while sharing the benchmark results. I will check when I got a free time.

We also don't use ZFS+RBD anymore because it's not efficient at all. To be honest, there is no way to make efficient ZFS over RBD. ZFS not designed for speed, it's designed for not to lose any data and CRC requires losing speed. Also RBD is not efficient, it's good for most cases but the response time makes it even harder for ZFS.

prghix commented 1 year ago

We also don't use ZFS+RBD anymore because it's not efficient at all. To be honest, there is no way to make efficient ZFS over RBD. ZFS not designed for speed, it's designed for not to lose any data and CRC requires losing speed. Also RBD is not efficient, it's good for most cases but the response time makes it even harder for ZFS.

I used to have ~20-30 MBps throughput on Mimic.

Now we have several Pacific/Quincy clusters and I'm on like a 1/10 of former throughput.

3MBs/sec are awful... tried to tune everything according to the manual:

https://openzfs.org/wiki/ZFS_on_high_latency_devices

looks like I have working aggregation.. but throughput is PAINFULLY slow :/

prghix commented 1 year ago

btw: I'm talking about saving ZFS snapshots (=backups) on RBD devices...

serjponomarev commented 9 months ago

@behlendorf

Hello. I've encountered the same issue as mentioned above, tried using the parameters discussed in this thread, but without success. Additionally, while monitoring zpool iostat -q 1, I noticed that the number of syncq_read during random reads never exceeds 1. In other words, when testing with a depth of 1 or a depth of 32, I was getting a performance of around 1000-1500 IOPS, and in zfs iostat, I observed a comparable latency of ~1ms-500us, corresponding to the tests with a depth of 1.

If using a depth of 1 and 32 threads, the number of requests matched the number of active synchronous read operations in zfs iostat -r, and it was 32. To address this issue, I experimented with zvol, and it worked. When testing with a single thread and a depth of 32 on a raw zvol, I achieved similar performance to testing ZFS with 32 threads and a depth of 1 - approximately 10K IOPS.

Next, I tested creating a zpool over a zvol, and I encountered the same issue with a depth limitation of 1. Then, I formatted the zvol to XFS, created and populated a test file, and ran random read tests. The results matched the previous test with 32 threads and a depth of 1, totaling around 10K IOPS.

In conclusion, using zvol resolves the problem for RBD because zvol facilitates request aggregation. All tests were conducted with primarycache=metadata.

OS: Ubuntu 22.04 Ceph: Quincy, 17.2.6 ZFS: 2.1.5

How can this experience be applied to ZFS over RBD without using zvol? If not possible, how safe is it to use large-sized zvols, for example, 10-20-30 TB? Which file system is best for a large zvol?

serjponomarev commented 9 months ago

@behlendorf

Hello. I've encountered the same issue as mentioned above, tried using the parameters discussed in this thread, but without success. Additionally, while monitoring zpool iostat -q 1, I noticed that the number of syncq_read during random reads never exceeds 1. In other words, when testing with a depth of 1 or a depth of 32, I was getting a performance of around 1000-1500 IOPS, and in zfs iostat, I observed a comparable latency of ~1ms-500us, corresponding to the tests with a depth of 1.

If using a depth of 1 and 32 threads, the number of requests matched the number of active synchronous read operations in zfs iostat -r, and it was 32. To address this issue, I experimented with zvol, and it worked. When testing with a single thread and a depth of 32 on a raw zvol, I achieved similar performance to testing ZFS with 32 threads and a depth of 1 - approximately 10K IOPS.

Next, I tested creating a zpool over a zvol, and I encountered the same issue with a depth limitation of 1. Then, I formatted the zvol to XFS, created and populated a test file, and ran random read tests. The results matched the previous test with 32 threads and a depth of 1, totaling around 10K IOPS.

In conclusion, using zvol resolves the problem for RBD because zvol facilitates request aggregation. All tests were conducted with primarycache=metadata.

OS: Ubuntu 22.04 Ceph: Quincy, 17.2.6 ZFS: 2.1.5

How can this experience be applied to ZFS over RBD without using zvol? If not possible, how safe is it to use large-sized zvols, for example, 10-20-30 TB? Which file system is best for a large zvol?

In general, if anyone is having problems with poor performance in ZFS via RBD, then you most likely have 1 thread 1 IOPS performance (regardless of the query depth you set).

Performance on such a configuration scales with threads. To address this issue, you either need to use more threads or aggregate the depth of the requests. I found two solutions for aggregating depth:

  1. zvol: The depth is regulated by the number of threads /sys/module/zfs/parameters/zvol_threads (default is 32).
  2. nfs server: By default, the number of processes in most distributions is 8. When you change the number of nfsd processes, the number of IOPS increases due to the increase in the number of nfsd threads. For local use, mount NFS via the loopback interface. For most distributions, the parameters for the number of nfs server processes are located in /etc/default. For Ubuntu 22.04 and higher, read this manual.

For zvol by default (32 threads) and nfs server (32 processes) with these parameters, I got maximum performance:

options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_write_max_active=32
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_aggregation_limit=1048576
options zfs zfs_vdev_aggregation_limit_non_rotating=1048576
options zfs zfs_dirty_data_max=1342177280

Description Module Options

oetiker commented 3 months ago

this issue seems to be lingering ... Has anyone done structured testing to determine appropriate settings for sensible performance of a setup with rbd backed zfs ? The features of such a combo are just to perfect to leave it in this state.

With that many tuneables, I am sure there must be a good set to make these setups work nicely. Anyone interested in a concerted effort to get the performance issues defined and sorted ?

Maybe writing a script which automates the approach defined in https://openzfs.org/wiki/ZFS_on_high_latency_devices would be a solution ...