Compression Results in Read Throttling

sempervictus commented 9 years ago

With the kmem changes and updated ABD stack back in play, i've been doing some benchmarking and coming to the conclusion that something in our compressed read pipeline is amiss. When running tiotest at 8 200MB threads to a dataset compressed with LZJB or LZ4, we're seeing reads (random and linear) bound at the 1480MB mark. When using GZIP that number bumps to ~1780MB/s. Without compression, we're seeing read speeds of ~3GB/s (while writes suffer tremendously).

Whats odd is that the write throughput changes considerably based on the amount of compression being done to the random bits/blocks generated, but the read performance stays the same despite more or less data being compressed per run when its time to read it back out.

The reason i'm filing an issue on this (as opposed to the write throughput on recent gen SSD when uncompressed), is because a year ago, we were seeing read throughput at or over the 3G mark on LZ4. Write throughput was a fraction of what we're seeing now, so my guess is that the kmem or related changes, have serialized access to something or created contention within the read pipeline which imposes this artificial limit.

Its not a show stopper in most use cases, but when backing big data or HPC applications, it may become a problem since those systems were built to spec on IO throughput, and a 50% reduction in compressed read speeds could have significant impact on those systems if they were to upgrade for the other functionality coming in 0.6.4 or 5.

Here's a bit of substantiation:

user@host:/tmp/testing$ zfs list | grep testing
rpool/testing                 1016K   285G   592K  /tmp/testing
vpool/testing                   96K   883G    96K  /tmp/testing/vpool

user@host:/tmp/testing$ zfs set compression=lz4 rpool/testing
user@host:/tmp/testing$ tiotest -t 8 -f 200 -d .
Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1600 MBs |    3.9 s | 412.808 MB/s | 314.5 %  | 1055.0 % |
| Random Write   31 MBs |    0.6 s |  51.729 MB/s | 148.3 %  | 162.6 % |
| Read         1600 MBs |    1.1 s | 1455.229 MB/s | 780.5 %  | 3956.7 % |
| Random Read    31 MBs |    0.0 s | 1446.960 MB/s | 652.9 %  | 4037.6 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.018 ms |        3.373 ms |  0.00000 |   0.00000 |
| Random Write |        0.042 ms |        2.958 ms |  0.00000 |   0.00000 |
| Read         |        0.020 ms |        0.211 ms |  0.00000 |   0.00000 |
| Random Read  |        0.020 ms |        0.085 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.019 ms |        3.373 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

user@host:/tmp/testing$ zfs set compression=lzjb rpool/testing
user@host:/tmp/testing$ tiotest -t 8 -f 200 -d .
Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1600 MBs |    1.5 s | 1088.115 MB/s | 970.2 %  | 2970.6 % |
| Random Write   31 MBs |    0.2 s | 163.836 MB/s | 696.2 %  | 314.6 % |
| Read         1600 MBs |    1.1 s | 1474.501 MB/s | 833.6 %  | 4098.3 % |
| Random Read    31 MBs |    0.0 s | 1434.012 MB/s | 807.6 %  | 3923.5 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.019 ms |        2.749 ms |  0.00000 |   0.00000 |
| Random Write |        0.043 ms |        1.714 ms |  0.00000 |   0.00000 |
| Read         |        0.020 ms |        0.188 ms |  0.00000 |   0.00000 |
| Random Read  |        0.020 ms |        0.075 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.020 ms |        2.749 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

user@host:/tmp/testing$ zfs set compression=gzip rpool/testing
user@host:/tmp/testing$ tiotest -t 8 -f 200 -d .
Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1600 MBs |    1.4 s | 1137.448 MB/s | 1000.7 %  | 3235.5 % |
| Random Write   31 MBs |    0.2 s | 163.138 MB/s | 684.9 %  | 487.6 % |
| Read         1600 MBs |    0.9 s | 1779.692 MB/s | 846.1 %  | 2768.8 % |
| Random Read    31 MBs |    0.0 s | 1741.626 MB/s | 791.4 %  | 2652.8 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.020 ms |        2.677 ms |  0.00000 |   0.00000 |
| Random Write |        0.048 ms |        3.286 ms |  0.00000 |   0.00000 |
| Read         |        0.012 ms |        0.204 ms |  0.00000 |   0.00000 |
| Random Read  |        0.012 ms |        0.054 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.016 ms |        3.286 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

user@host:/tmp/testing$ zfs set compression=off rpool/testing
user@host:/tmp/testing$ tiotest -t 8 -f 200 -d .
Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1600 MBs |    8.8 s | 181.706 MB/s | 199.4 %  | 518.4 % |
| Random Write   31 MBs |    0.4 s |  88.412 MB/s | 192.4 %  | 531.6 % |
| Read         1600 MBs |    0.5 s | 2968.372 MB/s | 2239.1 %  | 2977.6 % |
| Random Read    31 MBs |    0.0 s | 2798.173 MB/s | 2722.1 %  | 2489.3 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.020 ms |        6.090 ms |  0.00000 |   0.00000 |
| Random Write |        0.046 ms |        7.085 ms |  0.00000 |   0.00000 |
| Read         |        0.008 ms |        0.271 ms |  0.00000 |   0.00000 |
| Random Read  |        0.009 ms |        0.061 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.014 ms |        7.085 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

The host generating these numbers is doing so with an E5-2680 v2 on 12G of ARC allocation (all ARC'ed reads given the 1.6G of total data volume), with backing storage being a LUKS/dm-crypt encrypted Samsung 840 1T, not that it matters for the read problem.

Here's the current patch stack in testing (starts at bottom and builds up):

  * origin/pr/3078
  * origin/pr/3074
  * origin/pr/2970
  * origin/pr/2909
  * origin/pr/2890
  * origin/pr/2820
  * origin/pr/2784
  * origin/pr/2012
  * origin/pr/1099
  * origin/pr/2685
  * origin/pr/3063
  * origin/pr/3076
  * origin/pr/2129
  * master @ bf5efb5c66ac30442bded92c3299db36fe21d92f

behlendorf commented 9 years ago

@sempervictus it would be tremendously helpful if you could bisect where this performance regression was introduced, starting with 0.6.3 and including your full patch stack.

sempervictus commented 9 years ago

Timing's not ideal on that, but i'll try to carve out a few hours during the weekend (with V day that's going to be less than more). The fun part is that pretty much all our VMs are atop ZFS one way or another with the regression already present in the underlying storage layers (iSCSI/NFS backing, local disks, etc) since we're now running the kmem rework on most of our own systems. Systems which are running off of the stable packages in RHEL and Debian repos are somewhat "untouchable" for testing purposes, though i can tell you that none of our client CentOS/RHEL production boxes (running stable packages, backing multi-T PG clusters w/ ~4X CR on LZ4) are experiencing this read throttle.

I must admit that i've only started tracking my patch stack in the changelog recently in a consistent manner - i was pretty sloppy when i was building dozens of different configurations early on in my documentation and git behavior (there may have been some anger related "git branch -D" type stuff involved too).

Based on my notes, this happened sometime after late september, because i do have an annotation of reads going back to the ~3GB/s mark when ABD was re-applied on Sep. 20th to a Xyratex iSCSI host, though the version number for my build conflicts - looks like i reused a branch/version tag or actually pushed an older build in hopes of having stable ZVOLs backing iSCSI (which could mean that the regression occurred prior).

For my own sanity, before i start hurting boxes and running a few dozen build/reboot cycles on hardware to avoid underlying ZFS problems affecting our VM-side ZFS benchmarks (or at least module unloads where we're not using root=ZFS), is anyone else seeing this mess? Yoh in IRC just posted something similar, but this looks to be a problem that profiling would have caught soon as it came into the stack, presuming its in master and not one of my many merged PRs (hard to tell since we run progressive patch stacks with things which have been tested promoting into staging->production builds, meaning there's a bunch of shared PR code among the different internal revs).

I'll keep you posted on what i find, and hopefully can divine an actual commit hash for you. If anyone else can dive the testing phase to determine which revision introduced this, it would help a lot due to time constraints on our end, and the requirement for using a small subset of our resources to avoid tainting the test results.

sempervictus commented 9 years ago

I'm being blocked by #3094 on this, pardon.

behlendorf commented 9 years ago

@kpande if you haven't already done so you may want to jump back to stock 0.6.3 to get a baseline. Just to verify the test on your system will reproduce any regression. I'm also a bit surprised and the variably of the read rates when compression is disabled, any idea what's going on there?

behlendorf commented 9 years ago

0.6.3-1.2 includes build fixes for newer kernels.

sempervictus commented 9 years ago

@behlendorf: should this be tagged as a requirement for 0.6.4 so we dont throttle users who arent reading the issues section when they naively upgrade or their pkg manager does?

behlendorf commented 9 years ago

@sempervictus good question, I was just circling back to take a look at this. IIRC correctly we were attempting to identify exactly where the lz4 read performance regression was introduced. Did we ever manage to narrow it down? Let me add this as an 0.6.4 blocker for now and I'll try and do something testing myself and see if I can't reproduce the issue. Thanks for the reminder.

behlendorf commented 9 years ago

I was able to easily reproduce this. It was introduced somewhere between spl-0.6.3/zfs-0.6.3 and spl-0.6.3-12-gde2a22f / zfs-0.6.3-37-g5dbd68a. That exonerates the kmem-rework patches (which is a relief to me) and means it was introduced early in the 0.6.4 development cycle. I should be able to run it down the rest of the way this evening.

behlendorf commented 9 years ago

zfs-0.6.3-27-gb0bc7a8 is responsible, b0bc7a84d90dcbf5321d48c5b24ed771c5a128b0. This was a back port from illumos and it's a pretty sizable change so it's not immediately obvious what the issue is. Here's my data which pretty clearly shows the cliff, on a pristine pool created with a stock zfs-0.6.3 with no post-0.6.3 features enabled:

============ zfs-0.6.3-26-gfa86b5d (compression=lz4) ===============
Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1600 MBs |    1.7 s | 918.048 MB/s |  42.5 %  | 3547.9 % |
| Random Write   31 MBs |    0.3 s | 107.283 MB/s |  16.5 %  | 922.3 % |
| Read         1600 MBs |    0.6 s | 2610.561 MB/s |  92.8 %  | 6257.9 % |
| Random Read    31 MBs |    0.0 s | 2442.551 MB/s | 297.0 %  | 6244.2 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.018 ms |        1.499 ms |  0.00000 |   0.00000 |
| Random Write |        0.037 ms |        0.252 ms |  0.00000 |   0.00000 |
| Read         |        0.012 ms |        0.166 ms |  0.00000 |   0.00000 |
| Random Read  |        0.012 ms |        0.026 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.015 ms |        1.499 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

============ zfs-0.6.3-27-gb0bc7a8 (compression=lz4) ===============
Tiotest results for 8 concurrent io threads:
,----------------------------------------------------------------------.
| Item                  | Time     | Rate         | Usr CPU  | Sys CPU |
+-----------------------+----------+--------------+----------+---------+
| Write        1600 MBs |    1.5 s | 1078.935 MB/s |  53.5 %  | 4146.6 % |
| Random Write   31 MBs |    0.2 s | 192.761 MB/s |  54.9 %  | 1686.2 % |
| Read         1600 MBs |    1.2 s | 1358.959 MB/s |  44.6 %  | 6296.1 % |
| Random Read    31 MBs |    0.0 s | 1318.844 MB/s | 160.3 %  | 6148.0 % |
`----------------------------------------------------------------------'
Tiotest latency results:
,-------------------------------------------------------------------------.
| Item         | Average latency | Maximum latency | % >2 sec | % >10 sec |
+--------------+-----------------+-----------------+----------+-----------+
| Write        |        0.018 ms |       10.245 ms |  0.00000 |   0.00000 |
| Random Write |        0.038 ms |        0.393 ms |  0.00000 |   0.00000 |
| Read         |        0.023 ms |        0.235 ms |  0.00000 |   0.00000 |
| Random Read  |        0.023 ms |        0.061 ms |  0.00000 |   0.00000 |
|--------------+-----------------+-----------------+----------+-----------|
| Total        |        0.021 ms |       10.245 ms |  0.00000 |   0.00000 |
`--------------+-----------------+-----------------+----------+-----------'

behlendorf commented 9 years ago

Not yet, but it's a good idea. Maybe tomorrow.

dweeezil commented 9 years ago

@behlendorf Thanks for tracking this down to a particular commit. I've run into a bit of a brick wall while doing benchmarks on 175f4139. In particular, I'm not seeing nearly as good results as I'd expect from cached reads (running fio read tests with 100% cache hit rates and not using any compression at all). I'm wondering now if the cause behind this may be causing the poor results relative to the raw memory bandwidth of the machines on which I'm performing my testing and also relative to @prakashsurya's tests.

I'll take a closer look at it tomorrow. Although it's a big commit, a lot of it is clearly not interesting regarding this issue.

dweeezil commented 9 years ago

I might have an answer to some of the poor numbers I've been seeing in high-concurrency cached read tests. As of 0b39b9f modules built with --enable-debug will have dbuf_verify() enabled. Not only can this kill performance but it also causes sem->wait_lock to become a major contention point (particularly due to zfsonlinux/spl@46aa7b39 and zfsonlinux/spl@d503b971). As I mentioned in #3143, no modern x86_64 kernel will cause RWSEM_IS_LOCKED_TAKES_WAIT_LOCK to be defined which means the slow inline version is essentially always used. The big stack of rwsem patches (torvalds/linux@c8de2fa) added >= 3.10 kernels might play into this. In particular, I'm wondering is the need for the extra locking may have been eliminated (no, I've not examined any of the rwsem changes yet).

In any case, I wanted to point out that when doing this type of performance testing with an --enable-debug module, it's probably a good idea to set zfs_debug=0 to disable the dbuf_verify().

dweeezil commented 9 years ago

@kpande I have a feeling I'm running into something completely unrelated. That said, however, I do intend on trying my current testing regimen with/without compression.

behlendorf commented 9 years ago

OK, I have a partial explanation for this. The good news is that this probably doesn't impact many real workloads. At least unless your workload happens to depend on having lots of zero-filled ARC buffers cached. The bad news is that tiotest has been lying to you (kinda) and it should be enhanced.

Here's the deal. By default the data buffers written by tiotest are all zero-filled. When ZFS gets its hands on them it detects this and rightly converts them to holes. This significantly improves performance because ZFS includes special code and optimizations for handling these holes. What appears to have happened is that b0bc7a84d90dcbf5321d48c5b24ed771c5a128b0 accidentally dropped/broke some of these optimizations. The result being that reading cached holes is significantly slower than it used to be.

If you tweak tiotest to fill the first byte of each block with a non-zero character, thereby preventing ZFS from converting them to holes, you see more consistent behavior between 0.6.3 and master. There's still a little drop off but nothing nearly as dramatic.

Read Cache Performance	0.6.3	master
compression=off	2692.243	2868.463
compression=lz4	2102.624	1907.414

I was able to identify one of the hole optimizations which was dropped b0bc7a84d90dcbf5321d48c5b24ed771c5a128b0. The dbuf_write_ready() function had a fast path removed presumably to make the code more readable. However, that results in the caller holding the DB_DNODE_LOCK longer than needed and it means there's more contention of the db->db_mtx which hurts performance.

Restoring this optimization improved things but doesn't get the code quite back to what was observed in 0.6.3. A little more leg work is going to be needed but I've opened #3158 which takes this first step for review.

@dweeezil it would be great if you could take a careful look at the patch in question since you're already working on high-concurrency cached reads. That said the regressions here appear to be largely related to hole handling so it may not impact your testing much.

markdesouza commented 9 years ago

Sorry I know this is not the most appropriate location for this but isn't compression generally consider to improve performance (ie reduction in disk reads + time to decompress < full disk read).

I assume that the difference here is the data is already cached and that the compressed blocks are simply cached, so the reduction in read time from the cache < time to decompress. Hence the decrease in performance for reading compressed block in the cache.

To maintain the usefulness of compression as a performance option, shouldn't the decompressed blocks be stored in the cache, thereby making making it effectually the same as a dataset with no compression?

behlendorf commented 9 years ago

@markdesousa you've struck on the last bit of this which isn't entirely clear to me yet. In fact, the ARC buffers are stored uncompressed in memory and we see an excellent hit rate. So the compression method shouldn't have any impact on performance. Yet it seems to.

What should be limiting performance is lock contention. Specifically the dbuf mutex which covers these 128k ARC buffer. The various threads are accessing each buffer concurrently and reading 4k chunks causing significant contention. Larger block sizes perform much better.

markdesouza commented 9 years ago

@behlendorf so the dbuf mutex locks the entire arc buffer and not only the buffer intended to be read? if it was the later I thought the contention would be less as the blocksize was reduced. I am also unsure why multiple readers of a buffer would be contended. After all, multiple threads can read a block of memory, the writers or in this case eviction of the buffer would be the only event that is delayed.

I'm keen to get a handle on the internals so I can start contributing back to the project.

dweeezil commented 9 years ago

@behlendorf I just performed some testing of your patch in #3158 and although it does reduce the db->db_mtx contention, it makes minimal difference in a 8-thread I/O test. I use fio for my testing and this was its configuration:

[rand]
        blocksize_range=4k-128k,4k-128k
        zero_buffers            # fill files with zeroes
        scramble_buffers=0      # don't perturb the zeroes
        verify=null             # zeroes survive writes
        size=100m
        numjobs=8
        create_serialize=1
        direct=0
        directory=/tank/fish
        group_reporting=1
        ioengine=psync          # psync,libaio
        rw=randrw               # read,write,trim,randread,randwrite,randtrim,rw,randrw
        thread=1
        time_based=1
        runtime=1m

This sets up a variable blocksize random read/write test with small files so it should be fully cached. I created the filesystem with compression=lz4 for the compressed test and used the default recordsize (128k) as well.

Without the patch there were 2302344 contentions on db->db_mtx and with the patch there were only 1576221. Strangely enough, however, the patch dropped my read speed from 996MB/s to 848.3MB/s and the write speed from 617MB/s to 515.9MB/s. I'm not sure how statistically significant those number are (I did not repeat the test).

I also performed a baseline measurement letting fio randomize the data and running on a filesystem without compression. The contentions were only slightly lower with the patch in this case and the I/O bandwidths were about identical.

Another interesting note is that since these files are simply one big hole, there are no indirect blocks and that, in and of itself, causes an artificial speedup. If I simply do echo -n x >> /tank/fish/<file> to each of the files used by fio, the performance drops about 10% because there are a bunch of indirect blocks to deal with.

Other than that I don't think this patch (#3158) will do much for "real" workloads, it LGTM. As a complete side-node, I regret not beefing up the commit comment in b0bc7a8 a bit more to, at least, mention that it added the hold_birth feature.

FInally, I'll note that I'm seeing a lot more contention on the dbuf_hash_table.hash_mutexes[] in these tests than I'd expect. I'm continuing to investigate whether the numbers are really out of line or whether there is some other regression at play.

behlendorf commented 9 years ago

@dweeezil thanks for taking a look at the patch and testing it out. I completely agree, I don't think it will make any difference for "real" workloads. So in that sense there isn't really a regression here, aside from what this specific benchmark reports. And given that this really shouldn't be a blocker.

However, I definitely think there are still some mystery here which need to be explained. For example, why does the compression setting matter at all? And why did you see worse performance in your test when lock contention was reduced? To answer those things we'll need to more carefully dig in to this. Perhaps your work on 175f413964bf1e33dd4b0e5047780b31cd30e046 will shed some light on this.

devsk commented 9 years ago

I have been running the master for last 10 days or so. I find my scrub speeds have decreased somewhat (roughly 3hours instead of 2.5hours for one pool, 9hours instead of 8 for another, both use lz4 compression) compared to 0.6.3. Not sure if this bug is the culprit.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

openzfs / zfs

Compression Results in Read Throttling #3106