openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.63k stars 1.75k forks source link

zfs unnecessary reads to cache? 0.8.2-1, 4.18.0-80.el8.x86_64 SMP mod_unload modversions #9557

Closed jimfinlayson closed 3 years ago

jimfinlayson commented 5 years ago

Hi, this is first post for me on zfs, but I think I've found an issue I don't understand. I think I"ve completed the template accurately. I don't understand why zfs is doing reads when it should be doing only writes. If it is user error, I apologize and I'll slink back into my cave :) -->

System information

Type Version/Name
Distribution Name Redhat
Distribution Version Red Hat Enterprise Linux release 8.0 (Ootpa)
Linux Kernel 4.18.0-80.el8.x86_64
Architecture x86_64
ZFS Version 0.8.2-1
SPL Version 0.8.2-1

Describe the problem you're observing

I'm trying to opitmize for high bandwidth sequential writes using fio and I'm seeing considerable read activity on the file system that I don't understand when fio opens with O_CREAT,O_RDWR and runs fadvise to tell the file system it isn't going to use the existing contents of the file. I have metadata for the filesystem moved to a special volume and it didn't change the behavior.

Describe how to reproduce the problem

I had an existing 1TB file that fio created and when I run fio again, I see the behavior I don't understand.

strace -ff -o strace.out /usr/local/bin/fio --iodepth=4 --ioengine=libaio --direct=1 --norandommap --nrfiles=1 --filesize=1t --group_reporting --randrepeat=1 --random_generator=tausworthe64 --file_service_type=sequential --bs=4m --rw=write --numjobs=1 --name="fs01 sequential01" --directory=/fs01/test

/etc/modprobe.d/zfs.conf

options zfs zfs_vdev_aggregation_limit=16777216
options zfs zfetch_max_distance=65536
options zfs zfs_prefetch_disable=1
options zfs zfetch_max_streams=0
options zfs zfs_dirty_data_max=34359738368
options zfs zfs_delete_blocks=10
options zfs zfs_delay_scale=20000
options zfs zfs_vdev_async_read_max_active=16
options zfs zfs_vdev_sync_read_max_active=16
options zfs zfs_vdev_sync_read_min_active=16
options zfs metaslab_debug_unload=1
options zfs zfs_arc_meta_min=17179869184
options zfs zfs_vdev_async_write_active_min_dirty_percent=80
options zfs zfs_vdev_async_write_max_active=10
options zfs zfs_vdev_async_write_min_active=5
options zfs zfs_txg_timeout=3
options zfs zfs_vdev_scheduler=none
options zfs zfs_arc_max=68719476736
options zfs zfs_arc_min=68719476736
options zfs zfs_max_recordsize=16777216
options zfs zfs_arc_p_dampener_disable=0

zfs get all fs01 | grep local

fs01  recordsize            16M                    local
fs01  atime                 off                    local
fs01  logbias               throughput             local
fs01  dedup                 off                    local

zpool get all fs01 | grep local

fs01  ashift                         12                             local
fs01  feature@async_destroy          enabled                        local
fs01  feature@empty_bpobj            enabled                        local
fs01  feature@lz4_compress           active                         local
fs01  feature@multi_vdev_crash_dump  enabled                        local
fs01  feature@spacemap_histogram     active                         local
fs01  feature@enabled_txg            active                         local
fs01  feature@hole_birth             active                         local
fs01  feature@extensible_dataset     active                         local
fs01  feature@embedded_data          active                         local
fs01  feature@bookmarks              enabled                        local
fs01  feature@filesystem_limits      enabled                        local
fs01  feature@large_blocks           active                         local
fs01  feature@large_dnode            enabled                        local
fs01  feature@sha512                 enabled                        local
fs01  feature@skein                  enabled                        local
fs01  feature@edonr                  enabled                        local
fs01  feature@userobj_accounting     active                         local
fs01  feature@encryption             enabled                        local
fs01  feature@project_quota          active                         local
fs01  feature@device_removal         enabled                        local
fs01  feature@obsolete_counts        enabled                        local
fs01  feature@zpool_checkpoint       enabled                        local
fs01  feature@spacemap_v2            active                         local
fs01  feature@allocation_classes     active                         local
fs01  feature@resilver_defer         enabled                        local
fs01  feature@bookmark_v2            enabled                        local

zpool status -v fs01

  pool: fs01
 state: ONLINE
  scan: none requested
config:

        NAME                      STATE     READ WRITE CKSUM
        fs01                      ONLINE       0     0     0
          raidz2-0                ONLINE       0     0     0
            slot000_K005ZA29CVMV  ONLINE       0     0     0
            slot001_K005ZA29DBWN  ONLINE       0     0     0
            slot002_K005ZA29D5Y8  ONLINE       0     0     0
            slot003_K005ZA29DPTK  ONLINE       0     0     0
            slot004_K005ZA29CX7M  ONLINE       0     0     0
            slot005_K005ZA29D65W  ONLINE       0     0     0
            slot006_K005ZA29D6ZJ  ONLINE       0     0     0
            slot007_K005ZA29DPLB  ONLINE       0     0     0
            slot008_K005ZA29DHBY  ONLINE       0     0     0
            slot009_K005ZA29DDAR  ONLINE       0     0     0
            slot010_K005ZA29DD3E  ONLINE       0     0     0
            slot011_K005ZA29DD67  ONLINE       0     0     0
            slot012_K005ZA29DPH6  ONLINE       0     0     0
            slot013_K005ZA29DBZ1  ONLINE       0     0     0
            slot014_K005ZA29D62G  ONLINE       0     0     0
            slot015_K005ZA29DPB0  ONLINE       0     0     0
            slot016_K005ZA29DBY2  ONLINE       0     0     0
            slot017_K005ZA29CFR2  ONLINE       0     0     0
            slot018_K005ZA29CWTC  ONLINE       0     0     0
            slot019_K005ZA29DJM3  ONLINE       0     0     0
            slot020_K005ZA29DH04  ONLINE       0     0     0
            slot021_K005ZA29DKAH  ONLINE       0     0     0
            slot022_K005ZA29DATC  ONLINE       0     0     0
            slot023_K005ZA29CXGE  ONLINE       0     0     0
        special 
          mirror-1                ONLINE       0     0     0
            slot096_K005ZA29DPTD  ONLINE       0     0     0
            slot097_K005ZA29D5PC  ONLINE       0     0     0

errors: No known data errors

strace output

openat(AT_FDCWD, "/fs01/test/fs01 sequential01.0.0", O_RDWR|O_CREAT|O_DIRECT, 0600) = 3
fadvise64(3, 0, 1099511627776, POSIX_FADV_DONTNEED) = 0
fadvise64(3, 0, 1099511627776, POSIX_FADV_SEQUENTIAL) = 0
io_submit(0x7f9749e66000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="\230\224\303\341\230\17\325\34\223r\354Q\0211\366\5R\16\364\360+\35{\1\312\201\365v+kW\t"..., aio_nbytes=4194304, aio_offset=0}]) = 1
io_submit(0x7f9749e66000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="\30\3\257/\21Uy\233c\340i\244\312\222\216\30\f\274[d$\200\241\37\201w\205\256\326\3755\25"..., aio_nbytes=4194304, aio_offset=4194304}]) = 1
io_submit(0x7f9749e66000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="\0\0\200\0\0\0\0\0z\n\256_\376\3416\26O\301\270\306h\254\235\4)\230/\2700R\310\v"..., aio_nbytes=4194304, aio_offset=8388608}]) = 1
io_submit(0x7f9749e66000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="o*\345C)a\342\fM%E\2232\252P\n\251$\302\277\243\313\v\2\225\304\243e\223\231\316\4"..., aio_nbytes=4194304, aio_offset=12582912}]) = 1
io_getevents(0x7f9749e66000, 1, 1, [{data=0, obj=0x1a9ceb0, res=4194304, res2=0}], NULL) = 1
io_submit(0x7f9749e66000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="\230\224\303\341\230\17\325\34\223r\354Q\0211\366\5R\16\364\360+\35{\1\312\201\365v+kW\t"..., aio_nbytes=4194304, aio_offset=16777216}]) = 1
io_getevents(0x7f9749e66000, 1, 1, [{data=0, obj=0x1a9cd30, res=4194304, res2=0}], NULL) = 1
io_submit(0x7f9749e66000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="\30\3\257/\21Uy\233c\340i\244\312\222\216\30\f\274[d$\200\241\37\201w\205\256\326\3755\25"..., aio_nbytes=4194304, aio_offset=20971520}]) = 1
io_getevents(0x7f9749e66000, 1, 1, [{data=0, obj=0x1a9cb70, res=4194304, res2=0}], NULL) = 1
io_submit(0x7f9749e66000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="\0\0\200\0\0\0\0\0z\n\256_\376\3416\26O\301\270\306h\254\235\4)\230/\2700R\310\v"..., aio_nbytes=4194304, aio_offset=25165824}]) = 1
io_getevents(0x7f9749e66000, 1, 1, [{data=0, obj=0x1a9c9b0, res=4194304, res2=0}], NULL) = 1
io_submit(0x7f9749e66000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="\0\0\300\1\0\0\0\0M%E\2232\252P\n\251$\302\277\243\313\v\2\225\304\243e\223\231\316\4"..., aio_nbytes=4194304, aio_offset=29360128}]) = 1
io_getevents(0x7f9749e66000, 1, 1, [{data=0, obj=0x1a9ceb0, res=4194304, res2=0}], NULL) = 1
io_submit(0x7f9749e66000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="\0\0\0\2\0\0\0\0\223r\354Q\0211\366\5R\16\364\360+\35{\1\312\201\365v+kW\t"..., aio_nbytes=4194304, aio_offset=33554432}]) = 1
io_getevents(0x7f9749e66000, 1, 1, [{data=0, obj=0x1a9cd30, res=4194304, res2=0}], NULL) = 1
io_submit(0x7f9749e66000, 1, [{aio_lio_opcode=IOCB_CMD_PWRITE, aio_fildes=3, aio_buf="\30\3\257/\21Uy\233c\340i\244\312\222\216\30\f\274[d$\200\241\37\201w\205\256\326\3755\25"..., aio_nbytes=4194304, aio_offset=37748736}]) = 1

zpool iostat -q 1

              capacity     operations     bandwidth    syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool        alloc   free   read  write   read  write   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    346    563   252M   397M      0      0      0      0      0      0    112     10      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    660    197   480M  1.28M      0     12      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    489      0   356M      0      0      0      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    226    674   165M  1013M      0      0      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    997      0   726M      0      0      0      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    608      0   443M      0      0     21      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    379  1.50K   276M  2.27G      0     10      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    748      0   544M      0      0     21      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    777      0   566M      0      0      2      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    391  1.33K   285M  2.19G      0      1      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    803      0   584M      0      0     22      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    865      0   630M      0      0      2      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    392  1.45K   285M  2.32G      0      0      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    932      0   678M      0      0      0      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    801      0   583M      0      0      1      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    293  1.58K   214M  2.41G      0     11      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    769      0   560M      0      0      0      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    780      0   568M      0      0      0      0      0      0      0      0      0      0      0      0      0
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
fs01        16.0T   210T    300  1.37K   219M  2.36G      0      3      0      0      0      0      0      0      0      0      0      0
jimfinlayson commented 5 years ago

Same behavior with O_DIRECT off. The settings in the zfs.conf below are all things I've tried to try to get writes to stream. I'm trying to cache 3 seconds of 8GB/s streaming writes, separating metadata from data, trying to cache metadata and also free block lists (forget the zfs term), and trying to keep the arc size from thrashing - if I can negotiate with my customer for keeping 64GB of memory for zfs :)

options zfs zfs_prefetch_disable=1
options zfs zfetch_max_streams=0
options zfs metaslab_debug_unload=1
options zfs zfs_arc_meta_min=17179869184
options zfs zfs_txg_timeout=3
options zfs zfs_arc_min=68719476736
options zfs zfs_arc_p_dampener_disable=0
thewacokid commented 5 years ago

16 MB recordsize is likely the culprit. I don't think it's ever worked quite right north of 1 MB (I always saw read amplification in testing) and you're also writing 4 MB blocks. That means unless your writes are perfectly aligned and timed properly, you'll end up reading records to merge the 4 MB write into the 16 MB existing record at the end of every transaction group commit.

(I'm assuming here that the initial write works as you'd expect without any reads, I don't think fadvise is honored by ZFS)

Further note, those are reads to disk you're seeing there, not cache.

jimfinlayson commented 5 years ago

Why is there read amplification if it is copy on write? Even with the 4MB I/O size, there should be enough cache and the worst case if the txg timer kicked in, shouldn’t zfs do a 4MB+2p write?

Sent from my iPad

On Nov 6, 2019, at 6:45 PM, David Bonnie notifications@github.com wrote:



16 MB recordsize is likely the culprit. I don't think it's ever worked quite right north of 1 MB (I always saw read amplification in testing) and you're also writing 4 MB blocks. That means unless your writes are perfectly aligned and timed properly, you'll end up reading records to merge the 4 MB write into the 16 MB existing record.

(I'm assuming here that the initial write works as you'd expect without any reads, I don't think fadvise is honored by ZFS)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/9557?email_source=notifications&email_token=ANWMD24VUAGXC7N4GQWQPPTQSNJKVA5CNFSM4JJ4L6QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDIMIQY#issuecomment-550552643, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANWMD23XBEI3LGHQRLREER3QSNJKVANCNFSM4JJ4L6QA.

thewacokid commented 5 years ago

Once a file has grown beyond the recordsize, all records in that file are stored as the maximal size logically (except the last record in the file, if it is not aligned). So no matter what, if you're writing sub-record sizes, you're going to see some reading to merge into the existing record at the tail of each txg.

The read amplification comes from (at least in the testing I did on 0.7.x) from the mismatch between the underlying BIO queues and the very large 16 MB logical records.

stale[bot] commented 3 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.