Closed jimfinlayson closed 3 years ago
Same behavior with O_DIRECT off. The settings in the zfs.conf below are all things I've tried to try to get writes to stream. I'm trying to cache 3 seconds of 8GB/s streaming writes, separating metadata from data, trying to cache metadata and also free block lists (forget the zfs term), and trying to keep the arc size from thrashing - if I can negotiate with my customer for keeping 64GB of memory for zfs :)
options zfs zfs_prefetch_disable=1
options zfs zfetch_max_streams=0
options zfs metaslab_debug_unload=1
options zfs zfs_arc_meta_min=17179869184
options zfs zfs_txg_timeout=3
options zfs zfs_arc_min=68719476736
options zfs zfs_arc_p_dampener_disable=0
16 MB recordsize is likely the culprit. I don't think it's ever worked quite right north of 1 MB (I always saw read amplification in testing) and you're also writing 4 MB blocks. That means unless your writes are perfectly aligned and timed properly, you'll end up reading records to merge the 4 MB write into the 16 MB existing record at the end of every transaction group commit.
(I'm assuming here that the initial write works as you'd expect without any reads, I don't think fadvise is honored by ZFS)
Further note, those are reads to disk you're seeing there, not cache.
Why is there read amplification if it is copy on write? Even with the 4MB I/O size, there should be enough cache and the worst case if the txg timer kicked in, shouldn’t zfs do a 4MB+2p write?
Sent from my iPad
On Nov 6, 2019, at 6:45 PM, David Bonnie notifications@github.com wrote:
16 MB recordsize is likely the culprit. I don't think it's ever worked quite right north of 1 MB (I always saw read amplification in testing) and you're also writing 4 MB blocks. That means unless your writes are perfectly aligned and timed properly, you'll end up reading records to merge the 4 MB write into the 16 MB existing record.
(I'm assuming here that the initial write works as you'd expect without any reads, I don't think fadvise is honored by ZFS)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/zfsonlinux/zfs/issues/9557?email_source=notifications&email_token=ANWMD24VUAGXC7N4GQWQPPTQSNJKVA5CNFSM4JJ4L6QKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDIMIQY#issuecomment-550552643, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANWMD23XBEI3LGHQRLREER3QSNJKVANCNFSM4JJ4L6QA.
Once a file has grown beyond the recordsize, all records in that file are stored as the maximal size logically (except the last record in the file, if it is not aligned). So no matter what, if you're writing sub-record sizes, you're going to see some reading to merge into the existing record at the tail of each txg.
The read amplification comes from (at least in the testing I did on 0.7.x) from the mismatch between the underlying BIO queues and the very large 16 MB logical records.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
Hi, this is first post for me on zfs, but I think I've found an issue I don't understand. I think I"ve completed the template accurately. I don't understand why zfs is doing reads when it should be doing only writes. If it is user error, I apologize and I'll slink back into my cave :) -->
System information
Describe the problem you're observing
I'm trying to opitmize for high bandwidth sequential writes using fio and I'm seeing considerable read activity on the file system that I don't understand when fio opens with O_CREAT,O_RDWR and runs fadvise to tell the file system it isn't going to use the existing contents of the file. I have metadata for the filesystem moved to a special volume and it didn't change the behavior.
Describe how to reproduce the problem
I had an existing 1TB file that fio created and when I run fio again, I see the behavior I don't understand.
strace -ff -o strace.out /usr/local/bin/fio --iodepth=4 --ioengine=libaio --direct=1 --norandommap --nrfiles=1 --filesize=1t --group_reporting --randrepeat=1 --random_generator=tausworthe64 --file_service_type=sequential --bs=4m --rw=write --numjobs=1 --name="fs01 sequential01" --directory=/fs01/test
/etc/modprobe.d/zfs.conf
zfs get all fs01 | grep local
zpool get all fs01 | grep local
zpool status -v fs01
strace output
zpool iostat -q 1