openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.63k stars 1.75k forks source link

zfs compression slow on nvme #8749

Closed dangeReis closed 3 years ago

dangeReis commented 5 years ago

System information

Type Version/Name
Distribution Name amazon linux
Distribution Version 2
Linux Kernel 4.14.114-105.126
Architecture x86_64
ZFS Version 0.7.13-1
SPL Version 0.7.13-1

Describe the problem you're observing

I'm seeing an order of magnitude slower speeds from zfs with compression versus no compression

Describe how to reproduce the problem

This machine has 2 7.5TB NVME drives. Here is fio on a single drive with blocksize of 128k.

# fio --filename=/dev/nvme1n1  --sync=1 --rw=write --bs=128k --numjobs=20 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test --size=1G
journal-test: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, ioengine=psync, iodepth=1
...
fio-2.14
Starting 20 processes
Jobs: 20 (f=20): [W(20)] [100.0% done] [0KB/1368MB/0KB /s] [0/10.1K/0 iops] [eta 00m:00s]
journal-test: (groupid=0, jobs=20): err= 0: pid=27830: Wed May 15 23:29:27 2019
  write: io=97989MB, bw=1633.2MB/s, iops=13064, runt= 60002msec
    clat (usec): min=139, max=8666, avg=1527.44, stdev=465.46
     lat (usec): min=140, max=8671, avg=1530.05, stdev=465.06
    clat percentiles (usec):
     |  1.00th=[  314],  5.00th=[  516], 10.00th=[  716], 20.00th=[ 1096],
     | 30.00th=[ 1608], 40.00th=[ 1688], 50.00th=[ 1704], 60.00th=[ 1816],
     | 70.00th=[ 1832], 80.00th=[ 1832], 90.00th=[ 1832], 95.00th=[ 1848],
     | 99.00th=[ 2128], 99.50th=[ 2384], 99.90th=[ 3120], 99.95th=[ 3376],
     | 99.99th=[ 6304]
    lat (usec) : 250=0.29%, 500=4.38%, 750=6.16%, 1000=6.66%
    lat (msec) : 2=81.03%, 4=1.45%, 10=0.03%
  cpu          : usr=0.40%, sys=7.44%, ctx=5785512, majf=0, minf=228
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=783915/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=97989MB, aggrb=1633.2MB/s, minb=1633.2MB/s, maxb=1633.2MB/s, mint=60002msec, maxt=60002msec

Disk stats (read/write):
  nvme1n1: ios=193/1569501, merge=0/28375456, ticks=80/2158236, in_queue=2092716, util=100.00%

As you can see a single drive can do over 1.5GB/s with this configuration.

Let's create a pool with ashift=12

# zpool create tank /dev/nvme1n1 /dev/nvme2n1 -o ashift=12

Now lets test again with the same blocksize.

# fio --filename=/tank/test  --sync=1 --rw=write --bs=128k --numjobs=20 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test --size=1G
journal-test: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, ioengine=psync, iodepth=1
...
fio-2.14
Starting 20 processes
journal-test: Laying out IO file(s) (1 file(s) / 1024MB)
Jobs: 20 (f=20): [W(20)] [100.0% done] [0KB/1722MB/0KB /s] [0/13.8K/0 iops] [eta 00m:00s]
journal-test: (groupid=0, jobs=20): err= 0: pid=28857: Wed May 15 23:32:32 2019
  write: io=138055MB, bw=2300.9MB/s, iops=18406, runt= 60003msec
    clat (usec): min=309, max=12987, avg=1084.57, stdev=343.18
     lat (usec): min=310, max=12988, avg=1085.83, stdev=343.20
    clat percentiles (usec):
     |  1.00th=[  660],  5.00th=[  716], 10.00th=[  748], 20.00th=[  796],
     | 30.00th=[  844], 40.00th=[  900], 50.00th=[ 1004], 60.00th=[ 1128],
     | 70.00th=[ 1304], 80.00th=[ 1432], 90.00th=[ 1544], 95.00th=[ 1576],
     | 99.00th=[ 1912], 99.50th=[ 2096], 99.90th=[ 2544], 99.95th=[ 4576],
     | 99.99th=[ 8640]
    lat (usec) : 500=0.01%, 750=9.76%, 1000=39.90%
    lat (msec) : 2=49.61%, 4=0.68%, 10=0.05%, 20=0.01%
  cpu          : usr=0.28%, sys=7.20%, ctx=1677982, majf=0, minf=196
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1104438/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=138055MB, aggrb=2300.9MB/s, minb=2300.9MB/s, maxb=2300.9MB/s, mint=60003msec, maxt=60003msec

2.3GB/s. Not quite double, but very respectable.

Let's try with compression.

# zfs create tank/compressed
# zfs set compression=on tank/compressed

# fio --filename=/tank/compressed/test  --sync=1 --rw=write --bs=128k --numjobs=20 --iodepth=1 --runtime=60 --time_based --group_reporting --name=journal-test --size=1G
journal-test: (g=0): rw=write, bs=128K-128K/128K-128K/128K-128K, ioengine=psync, iodepth=1
...
fio-2.14
Starting 20 processes
Jobs: 20 (f=20): [W(20)] [100.0% done] [0KB/2262MB/0KB /s] [0/18.1K/0 iops] [eta 00m:00s]
journal-test: (groupid=0, jobs=20): err= 0: pid=7063: Wed May 15 23:37:50 2019
  write: io=138791MB, bw=2313.1MB/s, iops=18504, runt= 60002msec
    clat (usec): min=535, max=13064, avg=1078.59, stdev=206.50
     lat (usec): min=536, max=13067, avg=1079.99, stdev=206.53
    clat percentiles (usec):
     |  1.00th=[  956],  5.00th=[  980], 10.00th=[  996], 20.00th=[ 1012],
     | 30.00th=[ 1032], 40.00th=[ 1048], 50.00th=[ 1064], 60.00th=[ 1080],
     | 70.00th=[ 1096], 80.00th=[ 1112], 90.00th=[ 1160], 95.00th=[ 1192],
     | 99.00th=[ 1304], 99.50th=[ 1528], 99.90th=[ 2192], 99.95th=[ 6304],
     | 99.99th=[ 9152]
    lat (usec) : 750=0.01%, 1000=12.51%
    lat (msec) : 2=87.30%, 4=0.12%, 10=0.07%, 20=0.01%
  cpu          : usr=0.29%, sys=7.28%, ctx=1695106, majf=0, minf=205
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=1110324/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
  WRITE: io=138791MB, aggrb=2313.1MB/s, minb=2313.1MB/s, maxb=2313.1MB/s, mint=60002msec, maxt=60002msec

So we're still getting about 2.4GB/s but iotop says

Total DISK READ :       0.00 B/s | Total DISK WRITE :       2.34 G/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     822.68 M/s

zpool iostat -v 1

              capacity     operations     bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        1.98G  13.6T      0  15.9K      0   896M
  nvme1n1   1.24G  6.81T      0  8.05K      0   455M
  nvme2n1    756M  6.81T      0  7.85K      0   442M
----------  -----  -----  -----  -----  -----  -----

So we're getting about 2.4GB/s uncompressed, but once it gets compressed it only makes it to the disks at about 900MB/s.

# zfs get all tank/compressed
NAME             PROPERTY              VALUE                  SOURCE
tank/compressed  type                  filesystem             -
tank/compressed  creation              Wed May 15 23:34 2019  -
tank/compressed  used                  513M                   -
tank/compressed  available             13.2T                  -
tank/compressed  referenced            513M                   -
tank/compressed  compressratio         1.99x                  -
tank/compressed  mounted               yes                    -
tank/compressed  quota                 none                   default
tank/compressed  reservation           none                   default
tank/compressed  recordsize            128K                   default
tank/compressed  mountpoint            /tank/compressed       default
tank/compressed  sharenfs              off                    default
tank/compressed  checksum              on                     default
tank/compressed  compression           on                     local
tank/compressed  atime                 on                     default
tank/compressed  devices               on                     default
tank/compressed  exec                  on                     default
tank/compressed  setuid                on                     default
tank/compressed  readonly              off                    default
tank/compressed  zoned                 off                    default
tank/compressed  snapdir               hidden                 default
tank/compressed  aclinherit            restricted             default
tank/compressed  createtxg             381                    -
tank/compressed  canmount              on                     default
tank/compressed  xattr                 on                     default
tank/compressed  copies                1                      default
tank/compressed  version               5                      -
tank/compressed  utf8only              off                    -
tank/compressed  normalization         none                   -
tank/compressed  casesensitivity       sensitive              -
tank/compressed  vscan                 off                    default
tank/compressed  nbmand                off                    default
tank/compressed  sharesmb              off                    default
tank/compressed  refquota              none                   default
tank/compressed  refreservation        none                   default
tank/compressed  guid                  94744297835323235      -
tank/compressed  primarycache          all                    default
tank/compressed  secondarycache        all                    default
tank/compressed  usedbysnapshots       0B                     -
tank/compressed  usedbydataset         513M                   -
tank/compressed  usedbychildren        0B                     -
tank/compressed  usedbyrefreservation  0B                     -
tank/compressed  logbias               latency                default
tank/compressed  dedup                 off                    default
tank/compressed  mlslabel              none                   default
tank/compressed  sync                  standard               default
tank/compressed  dnodesize             legacy                 default
tank/compressed  refcompressratio      1.99x                  -
tank/compressed  written               513M                   -
tank/compressed  logicalused           1.00G                  -
tank/compressed  logicalreferenced     1.00G                  -
tank/compressed  volmode               default                default
tank/compressed  filesystem_limit      none                   default
tank/compressed  snapshot_limit        none                   default
tank/compressed  filesystem_count      none                   default
tank/compressed  snapshot_count        none                   default
tank/compressed  snapdev               hidden                 default
tank/compressed  acltype               off                    default
tank/compressed  context               none                   default
tank/compressed  fscontext             none                   default
tank/compressed  defcontext            none                   default
tank/compressed  rootcontext           none                   default
tank/compressed  relatime              off                    default
tank/compressed  redundant_metadata    all                    default
tank/compressed  overlay               off                    default

# zdb
tank:
    version: 5000
    name: 'tank'
    state: 0
    txg: 4
    pool_guid: 9038284836834321361
    errata: 0
    hostname: 'ip-172-31-13-249.ec2.internal'
    com.delphix:has_per_vdev_zaps
    vdev_children: 2
    vdev_tree:
        type: 'root'
        id: 0
        guid: 9038284836834321361
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 3742676939223884714
            path: '/dev/nvme1n1p1'
            devid: 'nvme-Amazon_EC2_NVMe_Instance_Storage_AWS1FB48757DDDADF624-ns-1-part1'
            phys_path: 'pci-0000:00:1e.0-nvme-1'
            whole_disk: 1
            metaslab_array: 259
            metaslab_shift: 36
            ashift: 12
            asize: 7499985453056
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_leaf: 129
            com.delphix:vdev_zap_top: 130
        children[1]:
            type: 'disk'
            id: 1
            guid: 17687303084345738590
            path: '/dev/nvme2n1p1'
            devid: 'nvme-Amazon_EC2_NVMe_Instance_Storage_AWS114DA56DDCAFA96B0-ns-1-part1'
            phys_path: 'pci-0000:00:1f.0-nvme-1'
            whole_disk: 1
            metaslab_array: 256
            metaslab_shift: 36
            ashift: 12
            asize: 7499985453056
            is_log: 0
            create_txg: 4
            com.delphix:vdev_zap_leaf: 131
            com.delphix:vdev_zap_top: 132
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

# zpool get all tank
NAME  PROPERTY                       VALUE                          SOURCE
tank  size                           13.6T                          -
tank  capacity                       0%                             -
tank  altroot                        -                              default
tank  health                         ONLINE                         -
tank  guid                           9038284836834321361            -
tank  version                        -                              default
tank  bootfs                         -                              default
tank  delegation                     on                             default
tank  autoreplace                    off                            default
tank  cachefile                      -                              default
tank  failmode                       wait                           default
tank  listsnapshots                  off                            default
tank  autoexpand                     off                            default
tank  dedupditto                     0                              default
tank  dedupratio                     1.00x                          -
tank  free                           13.6T                          -
tank  allocated                      2.31G                          -
tank  readonly                       off                            -
tank  ashift                         12                             local
tank  comment                        -                              default
tank  expandsize                     -                              -
tank  freeing                        0                              -
tank  fragmentation                  0%                             -
tank  leaked                         0                              -
tank  multihost                      off                            default
tank  feature@async_destroy          enabled                        local
tank  feature@empty_bpobj            active                         local
tank  feature@lz4_compress           active                         local
tank  feature@multi_vdev_crash_dump  enabled                        local
tank  feature@spacemap_histogram     active                         local
tank  feature@enabled_txg            active                         local
tank  feature@hole_birth             active                         local
tank  feature@extensible_dataset     active                         local
tank  feature@embedded_data          active                         local
tank  feature@bookmarks              enabled                        local
tank  feature@filesystem_limits      enabled                        local
tank  feature@large_blocks           enabled                        local
tank  feature@large_dnode            enabled                        local
tank  feature@sha512                 enabled                        local
tank  feature@skein                  enabled                        local
tank  feature@edonr                  enabled                        local
tank  feature@userobj_accounting     active                         local

with a compression ratio of 1.99, I would've expected the disks to be writing at around 2GB/s and the data to be written at around 4GB/s.

Include any warning/errors/backtraces from the system logs

DanielSmedegaardBuus commented 5 years ago

Hi :)

What's the CPU in this system? I can't find the exact level chosen for lz4 compression in zfs, but the references I'm reading seem to agree on ~500 MB/s compression per CPU core. If I compare that to lz4 -b benchmarks on my current box (with a somewhat dated 2GHz Core i7), that would put it at either level 1 or 2, since the numbers I get here match (~550 MB/s for both levels), while level 3 takes a nosedive to about a tenth of that speed, so that rules that out.

2.34 GB/s looks like four saturated cores that are slightly more performant than mine. If this is a 4-core setup you're running, I'm guessing that would be on par with expected performance? You mention Amazon Linux, so keep in mind that for some configurations on AWS a vCPU is a thread on a hyperthreaded host CPU, so you may have to subtract from performance expectations if your instance type falls into that category.

BTW, I'm not an authority here, I just spotted your headline while looking for something else and found it interesting :D Sorry if I'm not contributing.

dangeReis commented 5 years ago

It's got 24 of these:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz
stepping        : 4
microcode       : 0x200005e
cpu MHz         : 3114.812
cache size      : 33792 KB
physical id     : 0
siblings        : 24
core id         : 0
cpu cores       : 12
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds
bogomips        : 5000.00
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
DanielSmedegaardBuus commented 5 years ago

Haha :D Cool!

Well, that ends my thoughts then, but I'll be following with interest.

Nice rig :)

DanielSmedegaardBuus commented 5 years ago

Wait, just. If it's okay that I ask... I just looked up the CPU. (EDIT) At first, Google made me believe it was a different model putting your total CPU cost at $314,342.88. But, even without that error, why are you throwing 288 CPU cores at a 15TB (or 7.5TB mirrorred) zpool? Seems insane to me :/

Also, the only google results I get for your model are from cloud providers. Are you actually on AWS? Perhaps there's something wrong going on with them, then. I mean, 288 cores at 3.2 GHz, with the performance you're reporting, even if it seems insane to throw that many cores at ZFS when four or eight would do, that's something that's clearly off :)

EDIT 2: I just realized I'm the same as one of those people asking "why'd you wanna do that?" when someone asks a legitimate question. Sorry for that. Just ignore me. I'll still be following with interest :)

dangeReis commented 5 years ago

I only have 24 CPU cores. This is on AWS.

Something somewhere seems to be topping out at 2.4GB/s.

Edit: I see where you're getting 288, You're multiplying 24x12. I don't think that's how it works. There are a total of 24 CPUs in /proc/cpuinfo and in reality, only half of those are real cores and the rest is hyperthreading.

DanielSmedegaardBuus commented 5 years ago

Ah. And no, I got the 288 from you saying "It's got 24 of these" and then showing the cpu info of a 12-core CPU :) A single one of those makes a lot more sense. But, I'll shut up now. I've already butted in way too much here. Apologies.

behlendorf commented 5 years ago

@dangeReis I'd suggest trying the same test but not setting --sync=1 and see if your results improve. Forcing all the IO to be synchronous means that ZFS must perform the compression before it can return from the system call. Letting ZFS handle it asynchronously should allow it to be more efficiently processed by the IO threads and streamed to disk.

stale[bot] commented 4 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.