openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.43k stars 1.73k forks source link

Prefer read from fast/low-latency device in asymmetric mirror #11234

Open inkdot7 opened 3 years ago

inkdot7 commented 3 years ago

Describe the feature would like to see added to OpenZFS

When using an low-latency device (e.g. Optane drive) together with normal flash devices in a mirror, the reads should preferentially (almost always) go to the low-latency device, with the other flash devices just acting as redundancy.

A typical use-case for such a mirror is as the allocation classes special vdev, i.e. for meta-data and small files.

With #4334 this was handled for asymmetric mirrors consisting of SSD+HDD combinations, by using the hint from the linux kernel that HDDs are rotational, while SSDs are not, and then preferentially use the non-rotational device (SSD). This heuristic however does not work when all involved deviced are non-rotational flash.

A workaround is possible by marking the slower devices as rotational in the kernel (echo 1 > /sys/block/sdX/queue/rotational), before importing the pool.

It would be nice if such a hint about individual devices can be provided within OpenZFS, similar to the write-mostly mark in linux software raid.

Alternatively, it would perhaps be possible to use the response time statistics already available in the linux kernel, as shown by r_await with iostat -x to dynamically choose a prefered device, when the response times differ significantly.

How will this feature improve OpenZFS?

Better performance at lower cost (only one low-latency device needed per mirror).

Additional context

SSD and normal NVME devices typically have read latencies around 100 microseconds. Optane devices have latencies around 10 microseconds. This can have a considerable impact on read performance:

On one tested filesystem without and with the above workaround, listing all files (time find . -type f > allfiles.txt) shrunk from 23.1 to 16.5 seconds after fresh pool import. Also, having preselected a list of 10000 files (most of them small), reading them (cat allfiles_10000_1.txt | time xargs -n 100 md5sum > /dev/null) was reduced from 10.9 to 7.1 seconds. A 30-35 % improvement. (In this test, 5 % of the used space is in the allocation class special device.)

This was tested on Debian buster, amd64, with ZFS 0.8.4-2~bpo10+1. (I have not found any changelog messages or commits which suggest that behaviour has changed since.)

GregorKopka commented 3 years ago

The problem with the workaround is that reads will still be queued onto the slower devices, which will destroy performance (both latency and throughput), especially in case metadata demand reads get handled that way.

So +1 to write_mostly (or whatever name is fitting: write_only, cheap_redundancy, emergency_reads_only,...) functionality for mirrors that enforces reads from a marked as such leaf vdev only happening in recovery scenarios (resilver, read error recovery).

inkdot7 commented 3 years ago

The workaround works rather nicely. With only one sequence of accesses, all reads go to the fast device (sdg and sdh marked as rotational):

-------------  -----  -----  -----  -----  -----  -----
                 capacity     operations     bandwidth
pool           alloc   free   read  write   read  write
-------------  -----  -----  -----  -----  -----  -----
zfreke2         301G  1,11T  9,07K      0   128M      0
  raidz2        286G  1,03T  3,99K      0   113M      0
    sda4           -      -    553      0  13,4M      0
    sdb1           -      -    884      0  25,5M      0
    sdc1           -      -    630      0  17,9M      0
    sdd1           -      -    659      0  18,2M      0
    sde1           -      -    844      0  25,4M      0
    sdf1           -      -    524      0  13,0M      0
special            -      -      -      -      -      -
  mirror       14,5G  85,0G  5,08K      0  15,2M      0
    nvme0n1p1      -      -  5,08K      0  15,2M      0
    sdg1           -      -      0      0      0      0
    sdh1           -      -      0      0      0      0
-------------  -----  -----  -----  -----  -----  -----

with only some occasional reads from the slower devices in the special vdev:

special            -      -      -      -      -      -
  mirror       14,5G  85,0G  2,49K      0  11,5M      0
    nvme0n1p1      -      -  2,49K      0  11,4M      0
    sdg1           -      -      0      0      0      0
    sdh1           -      -      3      0  90,3K      0
-------------  -----  -----  -----  -----  -----  -----

When running several (here 5) access sequences in parallel, some reads are sent to the slow devices:

-------------  -----  -----  -----  -----  -----  -----
                 capacity     operations     bandwidth
pool           alloc   free   read  write   read  write
-------------  -----  -----  -----  -----  -----  -----
zfreke2         301G  1,11T  13,6K      0   276M      0
  raidz2        286G  1,03T  7,81K      0   221M      0
    sda4           -      -  1,39K      0  36,3M      0
    sdb1           -      -  1,24K      0  28,5M      0
    sdc1           -      -  1,14K      0  32,3M      0
    sdd1           -      -  1,42K      0  41,4M      0
    sde1           -      -  1,38K      0  41,8M      0
    sdf1           -      -  1,24K      0  40,7M      0
special            -      -      -      -      -      -
  mirror       14,5G  85,0G  5,79K      0  54,2M      0
    nvme0n1p1      -      -  5,69K      0  50,8M      0
    sdg1           -      -      3      0  57,4K      0
    sdh1           -      -    102      0  3,36M      0
-------------  -----  -----  -----  -----  -----  -----

That is perhaps not optimal in this case, but the loss is not very large.

In case the long-latency devices would have been NVME-style too, they would likely have as much bandwidth as the optane drive. In that case, those probbaly should take some requests for large chunks of data, where the long latency would not matter so much, in situations when the optane drive already has many outstanding requests.

inkdot7 commented 3 years ago

Follow-up: the values below are for a 60s average with 5 parallel access seqences. In the right-most column, I've additionally calculated the average operation size (just bandwidth/operations). The average size read from the long-latency devices (sdg and sdh) are larger than the average size served by the low-latency device.

                 capacity     operations     bandwidth      avg op-size
pool           alloc   free   read  write   read  write     (kB)
-------------  -----  -----  -----  -----  -----  -----
zfreke2         301G  1,11T  16,1K      0   406M      0     30.8
  raidz2        286G  1,03T  9,36K      0   368M      0     39.3
    sda4           -      -  1,61K      0  62,1M      0     38.5
    sdb1           -      -  1,58K      0  65,5M      0     41.5
    sdc1           -      -  1,57K      0  58,2M      0     37.1
    sdd1           -      -  1,45K      0  51,5M      0     35.5
    sde1           -      -  1,54K      0  63,6M      0     41.3
    sdf1           -      -  1,61K      0  66,8M      0     41.5
special            -      -      -      -      -      -
  mirror       14,5G  85,0G  6,77K      0  38,5M      0     5.7
    nvme0n1p1      -      -  6,66K      0  36,0M      0     5.4
    sdg1           -      -     63      0  1,34M      0     21.3
    sdh1           -      -     55      0  1,19M      0     21.6
-------------  -----  -----  -----  -----  -----  -----
richardelling commented 3 years ago

Preferred read for mirrors has been around for a long time (20+ years) on various systems. There are several ZFS forks that implemented various methods. It is easier today because there are per-vdev ZAPs that can store the preferences.

For automated preferences, you can't use IOPS, I/O Size, queue depth, or bandwidth for these calculations, because there is no correlation between any of those and latency for modern disks. The good news is we do measure latency for each operation already, so it is a relatively simple matter to keep a weighted, rolling average of the latency to each vdev in the mirror and bias accordingly. Even a simple, inexpensive algorithm can work like lat = lat >> 2 + newlat - newlat >> 2

behlendorf commented 3 years ago

Preferred reads are in fact what ZFS does today, the mirror child with the smallest "load" is selected when reading. Currently we use active queue depth as a measure of load but as @richardelling noted that doesn't really correlate with latency with modern disks. It would be interesting to update the code to use a rolling average for latency and see how that performs.

complyue commented 2 years ago

Hello, I come from Illumos ZFS (on SmartOS) in pondering with the similar idea, Illumos ZFS seems even hasn't catched up with N-way mirror read improvement yet.

See my post to illumos-discuss for the background.

I have an even more radical idea to improve the situation, that to refactor the io queue hierarchy, for every member disk (vdev) to claim & perform io tasks from a shared queue associated with the mirror, instead of one io queue per member disk (vdev). This way faster members just race to process more io requests, everyone's potential can be fully unleashed, without discriminating their natures, i.e. regardless of being optane, usual SSD, or HDD.

I suppose such an idea is trivial to implement in theory, but refactoring a mature/stable architecture w.r.t. the io scheduler and its data structures can have unexpected complications, at least it might need more effort to validate and battle test.

Anyway, what do you think about it?


Actually I have this idea in mind not only to improve read performance, but also write performance, with an opt-in compromise of redundancy upon write completion. That is, I want to have an option to accept single disk persistency for the write of a payload block considered completed, and leaving persisting to other member disks lower priority tasks ideally postponed to idle times. This strategy can improve peak write performance far beyond that limited by full redundancy guarantee per each write request. But I guess it'll need even more refactor of the io scheduler w.r.t. the writing aspects.