openzfsonwindows / openzfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
438 stars 15 forks source link

Performance regression during rclone copy with preallocation enabled #318

Closed EchterAgo closed 9 months ago

EchterAgo commented 10 months ago

This seems to be a regression that started with 1ecc0cab9b7c2ecfb0ef448111f12afe54b7ba33. See #283 for some more context.

I just logged zpool iostat 5 for an 8 threaded rclone copy to a newly created pool and it seems like there is some write amplification going on when blksz is not set to 4096.

iostat_512.txt iostat_4096.txt

Note that in the 512 byte case iostat reports ~80-100MB/s IO with a high write operation count, but rclone itself only reports 5-10MB/s. What I also noticed is that it takes a long time to even get the rclone process to abort the copy.

In the 4096 case iostat reports ~160-240MB/s and a much lower write operation count, which is about what I would expect and have seen before 1ecc0cab9b7c2ecfb0ef448111f12afe54b7ba33. rclone itself also reports a very similar speed range as iostat during the copy. I can also abort rclone within seconds instead of minutes.

Specifying --local-no-preallocate on the rclone command line also makes it go fast again.

EchterAgo commented 10 months ago

What is weird is that this seems to only happen with rclone so far, everything else does not seem to slow down. I tried a fio sequential write benchmark with 8 jobs and 32KB block size (same as rclone) and got the expected ~200MB/s speed.

lundman commented 10 months ago

Most likely it uses sync writes, to make sure it has landed on disk. Would be equivalent of sync=always on the dataset for the other tools.

EchterAgo commented 10 months ago

Wow, when I try rclone with 512 in blksz and preallocation disabled, it instantly hangs the virtual machine, and for some reason really hard, I can't even reset the machine in VMware and the kernel debugger will not break. After a few minutes it will recover. I'm not seeing any IO errors on the hardware though and also tried restarting the machine :\

I'm not sure if this also happens with 4096 in blksz, but this used to work.

EchterAgo commented 10 months ago

So the system recovers after a while and the debugger breaks, but it did not reveal anything interesting. However, when disabling preallocation in rclone, the copy seems to proceed at a normal speed.

EchterAgo commented 10 months ago

Hmmm, this only seems to happen when using VMWare, I don't see the same when using libvirt/QEMU. Actually it works as expected when I don't use VMWare Shared Folders as the copy source. It does still crash like in #319 when aborting the copy though.

Can this performance regression be related to preallocation/sparse files? On 1ecc0cab9b7c2ecfb0ef448111f12afe54b7ba33 that is the only difference I see between a test run that is fast and one that is slow. When slow, it starts out fast but then falls over the course of a minute to < 10MB/s, aborting takes minutes. When fast, it starts fast and stays fast, aborting takes seconds.

When I set blksz to 4096 it does not matter if the test scripts use preallocation/sparse files or not.

I guess I should find out whether specifying only --local-no-preallocate or --local-no-sparse on the rclone command line makes a difference.

EchterAgo commented 10 months ago

It is definitely the --local-no-preallocate flag that makes the difference, not --local-no-sparse.

EchterAgo commented 10 months ago

I also see the same behavior on my second host that uses QEMU instead of VMWare.

lundman commented 10 months ago

OK so, you feel confident that setting z_blksz to ashift gives you a considerable performance boost. This is when we only set z_blksz to 4096 if it is 0, does it help if we remove the if part, and always set it? Any corruption issues? Otherwise, let me add code to set it nicely.

lundman commented 10 months ago

Hmm I wonder if it is just zfs_log_write - if we change blocksize there to ashift, and prealloc, will it go faster. That would be a much smaller change to upstream

EchterAgo commented 10 months ago

OK so, you feel confident that setting z_blksz to ashift gives you a considerable performance boost.

That does definitely improve things. I'm unsure if it is the right thing to do, I just noticed performance drop sharply after 1ecc0cab9b7c2ecfb0ef448111f12afe54b7ba33. What is still weird is that I have been unable to measure this performance drop with anything other than rclone. I tried to recreate the rclone IO pattern using fio, but I'm not sure it actually supports preallocation on Windows. All my tests with fio so far showed expected performance levels. I'll have to check if it actually does preallocation.

This is when we only set z_blksz to 4096 if it is 0, does it help if we remove the if part, and always set it? Any corruption issues? Otherwise, let me add code to set it nicely.

I'll try that.

lundman commented 10 months ago

Or try the blocksize in zfs_log_write() - then we dont change much on upstream

EchterAgo commented 10 months ago

Setting blocksize only in zfs_log_write does make the copy work fast. I'll need to let it run for a few hours to confirm the memory issue is still good, but I expect it will be.

diff --git a/module/os/windows/zfs/zfs_znode.c b/module/os/windows/zfs/zfs_znode.c
index f501b3ac1..eff94188f 100644
--- a/module/os/windows/zfs/zfs_znode.c
+++ b/module/os/windows/zfs/zfs_znode.c
@@ -624,9 +624,6 @@ zfs_znode_alloc(zfsvfs_t *zfsvfs, dmu_buf_t *db, int blksz,
        zp->z_zfsvfs = zfsvfs;
        mutex_exit(&zfsvfs->z_znodes_lock);

-       if (zp->z_blksz == 0)
-               zp->z_blksz = zfs_blksz(zp);
-
        return (zp);
 }

diff --git a/module/zfs/zfs_log.c b/module/zfs/zfs_log.c
index 613fcdb23..ad0dfec0d 100644
--- a/module/zfs/zfs_log.c
+++ b/module/zfs/zfs_log.c
@@ -634,6 +634,9 @@ zfs_log_write(zilog_t *zilog, dmu_tx_t *tx, int txtype,
        uint64_t gen = 0;
        ssize_t size = resid;

+       if (blocksize == 0)
+               blocksize = 4096;
+
        if (zil_replaying(zilog, tx) || zp->z_unlinked ||
            zfs_xattr_owner_unlinked(zp)) {
                if (callback != NULL)
lundman commented 10 months ago

excellent, I'll do a commit to fish out the value properly when I wake

lundman commented 10 months ago

I moved it back into zfs_znode.c - there's too much change to zfs_log.c otherwise.

f0e86b4

EchterAgo commented 10 months ago

I think this is fixed.

lundman commented 10 months ago

I had to change the way we do this, running the same commit on macOS caused panic in the removal tests, it is likely to be a bug in that code, see: https://github.com/openzfs/zfs/issues/12078

but don't want to wait for that to be fixed, so let's handle the performance change differently, in just zfs_log_write(). This passes zfs-tests as before.

EchterAgo commented 9 months ago

I also just did some more tests, I doubt the performance issue actually comes from zfs_log_write. It is a separate issue when setting the z_blksz to 512 in zfs_znode_alloc. When I just set the blocksize to 512 in zfs_log_write, and remove setting the zp->z_blkszI in zfs_znode_alloc then I also get the expected performance.

Good (as described above), going @ ~220MB/s:

PS C:\Windows\system32> C:\tools\zfs\zpool.exe iostat -r 5

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0    128      0      0      0    749      0      0      0      0      0      0      0
8K             12      0     28      0      0      0      0    189      0      0      0      0      0      0
16K             0      0      0      0      0      0      0     56      0      0      0      0      0      0
32K             0      0      0      0      0      0      0      9      0      0      0      0      0      0
64K            16      0     32      0      0      0      3      0      0      0      0      0      0      0
128K            0      0     16      0      0      0      8      4      0      0      0      0      0      0
256K            0      0      0      0      0      0     12     15      0      0      0      0      0      0
512K            0      0      0      0      0      0  1.10K     46      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0     17      0      0      0    105      0      0      0      0      0      0      0
8K              0      0      0      0      0      0      0     20      0      0      0      0      0      0
16K             0      0      0      0      0      0      0     10      0      0      0      0      0      0
32K             0      0      0      0      0      0      0      0      0      0      0      0      0      0
64K             0      0      0      0      0      0      1      0      0      0      0      0      0      0
128K            0      0      0      0      0      0      3      0      0      0      0      0      0      0
256K            0      0      0      0      0      0     28      2      0      0      0      0      0      0
512K            0      0      0      0      0      0    312      9      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0     22      0      0      0    160      0      0      0      0      0      0      0
8K              0      0      0      0      0      0      0     34      0      0      0      0      0      0
16K             0      0      0      0      0      0      1     11      0      0      0      0      0      0
32K             0      0      0      0      0      0      7      0      0      0      0      0      0      0
64K             0      0      0      0      0      0      8      0      0      0      0      0      0      0
128K            0      0      0      0      0      0      8      3      0      0      0      0      0      0
256K            0      0      0      0      0      0     57      2      0      0      0      0      0      0
512K            0      0      0      0      0      0    365     25      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0     17      0      0      0    126      0      0      0      0      0      0      0
8K              0      0      0      0      0      0      0     27      0      0      0      0      0      0
16K             0      0      0      0      0      0      0     10      0      0      0      0      0      0
32K             0      0      0      0      0      0      2      2      0      0      0      0      0      0
64K             0      0      0      0      0      0      3      0      0      0      0      0      0      0
128K            0      0      0      0      0      0      2      0      0      0      0      0      0      0
256K            0      0      0      0      0      0      7      1      0      0      0      0      0      0
512K            0      0      0      0      0      0    417      2      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0     20      0      0      0    144      0      0      0      0      0      0      0
8K              0      0      0      0      0      0      3     26      0      0      0      0      0      0
16K             0      0      0      0      0      0      0      7      0      0      0      0      0      0
32K             0      0      0      0      0      0      1      2      0      0      0      0      0      0
64K             0      0      0      0      0      0      1      0      0      0      0      0      0      0
128K            0      0      0      0      0      0      3      0      0      0      0      0      0      0
256K            0      0      0      0      0      0     55      3      0      0      0      0      0      0
512K            0      0      0      0      0      0    329     37      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------

Now with only

if (zp->z_blksz == 0)
    zp->z_blksz = zfs_blksz(zp);

in zfs_znode_alloc, with setting blocksize removed in zfs_log_write.

Bad (starts at > 200MB/s, slows down over multiple seconds and ends up at a few hundred kilobytes per second):

PS C:\Windows\system32> C:\tools\zfs\zpool.exe iostat -r 5

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0    128      0      0      0  35.1K      0      0      0      0      0      0      0
8K             12      0     28      0      0      0  1.11K  16.8K      0      0      0      0      0      0
16K             0      0      0      0      0      0  1.07K  9.45K      0      0      0      0      0      0
32K             0      0      0      0      0      0      0  6.04K      0      0      0      0      0      0
64K            16      0     32      0      0      0      0  9.71K      0      0      0      0      0      0
128K            0      0     16      0      0      0      0  7.86K      0      0      0      0      0      0
256K            0      0      0      0      0      0      0    259      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      4      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0      2      0      0      0  8.03K      0      0      0      0      0      0      0
8K              0      0      0      0      0      0     29  1.95K      0      0      0      0      0      0
16K             0      0      0      0      0      0     29    324      0      0      0      0      0      0
32K             0      0      0      0      0      0      0     21      0      0      0      0      0      0
64K             0      0      0      0      0      0      0     14      0      0      0      0      0      0
128K            0      0      0      0      0      0      0      8      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      1      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0      0      0      0      0    314      0      0      0      0      0      0      0
8K              0      0      0      0      0      0     44    138      0      0      0      0      0      0
16K             0      0      0      0      0      0     34    128      0      0      0      0      0      0
32K             0      0      0      0      0      0      0    144      0      0      0      0      0      0
64K             0      0      0      0      0      0      0    289      0      0      0      0      0      0
128K            0      0      0      0      0      0      0    281      0      0      0      0      0      0
256K            0      0      0      0      0      0      0     13      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0      5      0      0      0    191      0      0      0      0      0      0      0
8K              0      0      0      0      0      0     22    109      0      0      0      0      0      0
16K             0      0      0      0      0      0     19     79      0      0      0      0      0      0
32K             0      0      0      0      0      0      0     62      0      0      0      0      0      0
64K             0      0      0      0      0      0      0    137      0      0      0      0      0      0
128K            0      0      0      0      0      0      0    444      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      7      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0      2      0      0      0    203      0      0      0      0      0      0      0
8K              0      0      0      0      0      0     33    123      0      0      0      0      0      0
16K             0      0      0      0      0      0     28     96      0      0      0      0      0      0
32K             0      0      0      0      0      0      0    114      0      0      0      0      0      0
64K             0      0      0      0      0      0      0    236      0      0      0      0      0      0
128K            0      0      0      0      0      0      0    340      0      0      0      0      0      0
256K            0      0      0      0      0      0      0     11      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------

Data          sync_read    sync_write    async_read    async_write      scrub         trim         rebuild
req_size      ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg    ind    agg
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
512             0      0      0      0      0      0      0      0      0      0      0      0      0      0
1K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2K              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4K              0      0      2      0      0      0    216      0      0      0      0      0      0      0
8K              0      0      0      0      0      0     30    115      0      0      0      0      0      0
16K             0      0      0      0      0      0     15     83      0      0      0      0      0      0
32K             0      0      0      0      0      0      0     90      0      0      0      0      0      0
64K             0      0      0      0      0      0      0    157      0      0      0      0      0      0
128K            0      0      0      0      0      0      0    461      0      0      0      0      0      0
256K            0      0      0      0      0      0      0      5      0      0      0      0      0      0
512K            0      0      0      0      0      0      0      0      0      0      0      0      0      0
1M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
2M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
4M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
8M              0      0      0      0      0      0      0      0      0      0      0      0      0      0
16M             0      0      0      0      0      0      0      0      0      0      0      0      0      0
------------------------------------------------------------------------------------------------------------
PS C:\Windows\system32>
EchterAgo commented 9 months ago

To make it more clear what I mean with good / bad, here are the code diffs to latest zfs-Windows-2.2.0-release branch (eb7f88a64fe782d2b793d53b56f2bd9bacc9d694):

Good:

diff --git a/module/os/windows/zfs/zfs_znode.c b/module/os/windows/zfs/zfs_znode.c
index f501b3ac1..eff94188f 100644
--- a/module/os/windows/zfs/zfs_znode.c
+++ b/module/os/windows/zfs/zfs_znode.c
@@ -624,9 +624,6 @@ zfs_znode_alloc(zfsvfs_t *zfsvfs, dmu_buf_t *db, int blksz,
        zp->z_zfsvfs = zfsvfs;
        mutex_exit(&zfsvfs->z_znodes_lock);

-       if (zp->z_blksz == 0)
-               zp->z_blksz = zfs_blksz(zp);
-
        return (zp);
 }

diff --git a/module/zfs/zfs_log.c b/module/zfs/zfs_log.c
index f3013ba2d..8210e3a94 100644
--- a/module/zfs/zfs_log.c
+++ b/module/zfs/zfs_log.c
@@ -646,8 +646,8 @@ zfs_log_write(zilog_t *zilog, dmu_tx_t *tx, int txtype,
        }

 #ifdef _WIN32
-       if (zp->z_zfsvfs->z_os->os_spa->spa_min_alloc > blocksize)
-               blocksize = zp->z_zfsvfs->z_os->os_spa->spa_min_alloc;
+       if (blocksize == 0)
+               blocksize = 512;
 #endif

        if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)

Bad:

diff --git a/module/zfs/zfs_log.c b/module/zfs/zfs_log.c
index f3013ba2d..70b16e373 100644
--- a/module/zfs/zfs_log.c
+++ b/module/zfs/zfs_log.c
@@ -645,11 +645,6 @@ zfs_log_write(zilog_t *zilog, dmu_tx_t *tx, int txtype,
                return;
        }

-#ifdef _WIN32
-       if (zp->z_zfsvfs->z_os->os_spa->spa_min_alloc > blocksize)
-               blocksize = zp->z_zfsvfs->z_os->os_spa->spa_min_alloc;
-#endif
-
        if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
                write_state = WR_INDIRECT;
        else if (!spa_has_slogs(zilog->zl_spa) &&

I'm pretty sure this is not about zfs_log_write.

EchterAgo commented 9 months ago

Also, the zfs_log_write infinite loop only happened after having already copied 2.2TB, this performance issue happens pretty much directly after starting the copy, and only when preallocation is enabled.

lundman commented 9 months ago

OKOK sorry, so you are saying performance wise, you want the change in zfs_znode.c? The zfs_log_write I thought was the slow issue, and shouldnt be 0 if the new code is there, since it should set it

EchterAgo commented 9 months ago

Yes, it seems performance wise it only makes a difference if set in zfs_znode_alloc.

I also just noticed something else interesting regarding this, give me a few minutes to investigate.

EchterAgo commented 9 months ago

Ok, nevermind, I though performance recovers, but now after 30 minutes it is still slow. Starts out fast, then perf slowly falls:

Outdated, see https://github.com/openzfsonwindows/openzfs/issues/318#issuecomment-1806859441

lundman commented 9 months ago

ok, so? go back to znode_alloc code, and just make sure blocksize is not 0?

EchterAgo commented 9 months ago

So, setting the z_blksz in zfs_znode_alloc to 512 or zfs_blksz when it is zero causes performance issue. 4096 works fast, leaving it at 0 also works fast. This is independent of the infinite loop in zfs_log_write.

This also means that the latest zfs-Windows-2.2.0-release branch has performance issues again.

EchterAgo commented 9 months ago

ok, so? go back to znode_alloc code, and just make sure blocksize is not 0?

I think it should be enough to set blocksize to a reasonable value in zfs_log_write if it is 0. No need to set z_blksz in zfs_znode_alloc.

lundman commented 9 months ago

so you are saying its enough to not-change-anything in znode? and just change zfs_log ?

EchterAgo commented 9 months ago

so you are saying its enough to not-change-anything in znode? and just change zfs_log ?

Yes

EchterAgo commented 9 months ago
diff --git a/module/os/windows/zfs/zfs_znode.c b/module/os/windows/zfs/zfs_znode.c
index f501b3ac1..eff94188f 100644
--- a/module/os/windows/zfs/zfs_znode.c
+++ b/module/os/windows/zfs/zfs_znode.c
@@ -624,9 +624,6 @@ zfs_znode_alloc(zfsvfs_t *zfsvfs, dmu_buf_t *db, int blksz,
        zp->z_zfsvfs = zfsvfs;
        mutex_exit(&zfsvfs->z_znodes_lock);

-       if (zp->z_blksz == 0)
-               zp->z_blksz = zfs_blksz(zp);
-
        return (zp);
 }

diff --git a/module/zfs/zfs_log.c b/module/zfs/zfs_log.c
index f3013ba2d..8210e3a94 100644
--- a/module/zfs/zfs_log.c
+++ b/module/zfs/zfs_log.c
@@ -646,8 +646,8 @@ zfs_log_write(zilog_t *zilog, dmu_tx_t *tx, int txtype,
        }

 #ifdef _WIN32
-       if (zp->z_zfsvfs->z_os->os_spa->spa_min_alloc > blocksize)
-               blocksize = zp->z_zfsvfs->z_os->os_spa->spa_min_alloc;
+       if (blocksize == 0)
+               blocksize = 512;
 #endif

        if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)

basically this

lundman commented 9 months ago

But, this

-       if (zp->z_zfsvfs->z_os->os_spa->spa_min_alloc > blocksize)
-               blocksize = zp->z_zfsvfs->z_os->os_spa->spa_min_alloc;

should work, if blocksize is 0, it should set it to something greater than that?

EchterAgo commented 9 months ago

That works too, but by setting it fixed you don't need spa_min_alloc, if that is a concern.

lundman commented 9 months ago

yeah, but i thought we wanted to set it to ashift, so that it is faster.

EchterAgo commented 9 months ago

My tests showed that it does not matter what we set in zfs_log_write, 512 works just as fast.

The performance difference comes only from setting z_blksz in zfs_znode_alloc, unless in my case it is set to 4096.

lundman commented 9 months ago

ok so then, check for blocksize==0, and set to say, SPA_MINBLOCKSIZE or whatever its called.

then in znode, we want to set it to ashift?

EchterAgo commented 9 months ago

ok so then, check for blocksize==0, and set to say, SPA_MINBLOCKSIZE or whatever its called.

then in znode, we want to set it to ashift?

If we have the blocksize==0 check in zfs_log_write we don't need it in zfs_znode_alloc, so I'd just remove it there. Or we need to investigate why the performance drops when setting it wrong in zfs_znode_alloc.

EchterAgo commented 9 months ago

It also seems clear that zfs_log_write was not written with zfs_blksz==0 in mind, so maybe we should investigate why it goes to 0 in the first place. Maybe there are more assumptions like this in the code.

lundman commented 9 months ago

ok if its left as 0, then the zfs_write() tunes it up, usually by powers of 2. So we just need a 0 test in zfs_log. Understood.

What is unusual is the path it takes in zfs_log_write(). Linux etc, does not generally enter WR_INDIRECT before zfs_write(), but windows does if it preallocates. So they most likely never noticed.

EchterAgo commented 9 months ago

Oh, that makes sense. I don't really have a good grasp of the code yet.

Did you check the iostat -r I showed earlier? In the good case it gets much more ind writes, while in the bad case it gets a lot of agg writes. Both cases run rclone with preallocation though.

With preallocation disabled I never saw any performance difference.

Edit: swapped good/bad

EchterAgo commented 9 months ago

I managed to add preallocation support to the Windows fio port and it does exhibit the same issues as rclone. The patch is not in a shape to upstream yet, but build artifacts should be available here shortly: https://github.com/EchterAgo/fio/actions/runs/6835259116

I used the MSYS2 64-bit version.

You'll need a configuration like this:

[global]
ioengine=windowsaio
direct=1
iodepth=8
time_based
runtime=600
directory=H\:\testdir
filename_format=testfile_$jobname_$jobnum_$filenum.bin
bs=32k
size=16G
fallocate=native

[sequential_write]
rw=write
numjobs=8

change the directory parameter (note the escaped :) and then run fio name_of_config_file.fio.

If you change it to fallocate=none it runs fast.

EchterAgo commented 9 months ago

https://github.com/EchterAgo/fio/suites/18107906055/artifacts/1043791681

EchterAgo commented 9 months ago

I noticed that the write speed instantly drops, the gradual dropping of displayed speed in rclone seems to just be its averaging.

EchterAgo commented 9 months ago

Bad results:

Jobs: 8 (f=8): [W(8)][100.0%][w=5433KiB/s][w=169 IOPS][eta 00m:00s]
sequential_write: (groupid=0, jobs=1): err= 0: pid=5588: Sat Nov 11 23:31:24 2023
  write: IOPS=22, BW=709KiB/s (726kB/s)(416MiB/600291msec); 0 zone resets
    slat (usec): min=53, max=4437, avg=871.41, stdev=570.06
    clat (msec): min=59, max=31377, avg=359.96, stdev=1063.66
     lat (msec): min=59, max=31379, avg=360.84, stdev=1063.68
    clat percentiles (msec):
     |  1.00th=[   80],  5.00th=[  108], 10.00th=[  127], 20.00th=[  161],
     | 30.00th=[  264], 40.00th=[  300], 50.00th=[  351], 60.00th=[  368],
     | 70.00th=[  380], 80.00th=[  401], 90.00th=[  422], 95.00th=[  435],
     | 99.00th=[  489], 99.50th=[ 3339], 99.90th=[17113], 99.95th=[17113],
     | 99.99th=[17113]
   bw (  KiB/s): min=   62, max= 2629, per=14.62%, avg=830.76, stdev=383.90, samples=1010
   iops        : min=    1, max=   82, avg=25.68, stdev=11.97, samples=1010
  lat (msec)   : 100=3.62%, 250=24.24%, 500=71.39%, 750=0.19%, >=2000=0.56%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.3%, 8=99.7%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,13307,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=8028: Sat Nov 11 23:31:24 2023
  write: IOPS=22, BW=710KiB/s (727kB/s)(416MiB/600294msec); 0 zone resets
    slat (usec): min=40, max=5751, avg=874.60, stdev=580.65
    clat (msec): min=61, max=31441, avg=359.67, stdev=1073.03
     lat (msec): min=61, max=31442, avg=360.54, stdev=1073.05
    clat percentiles (msec):
     |  1.00th=[   80],  5.00th=[  108], 10.00th=[  126], 20.00th=[  161],
     | 30.00th=[  264], 40.00th=[  300], 50.00th=[  351], 60.00th=[  368],
     | 70.00th=[  380], 80.00th=[  401], 90.00th=[  422], 95.00th=[  439],
     | 99.00th=[  493], 99.50th=[ 2802], 99.90th=[17113], 99.95th=[17113],
     | 99.99th=[17113]
   bw (  KiB/s): min=   62, max= 2757, per=14.63%, avg=831.46, stdev=385.48, samples=1010
   iops        : min=    1, max=   86, avg=25.70, stdev=12.03, samples=1010
  lat (msec)   : 100=3.67%, 250=24.22%, 500=71.21%, 750=0.34%, >=2000=0.56%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=99.8%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,13318,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=6584: Sat Nov 11 23:31:24 2023
  write: IOPS=22, BW=710KiB/s (727kB/s)(416MiB/600300msec); 0 zone resets
    slat (usec): min=56, max=4394, avg=873.37, stdev=586.18
    clat (msec): min=54, max=31414, avg=359.89, stdev=1064.09
     lat (msec): min=56, max=31415, avg=360.76, stdev=1064.11
    clat percentiles (msec):
     |  1.00th=[   80],  5.00th=[  108], 10.00th=[  126], 20.00th=[  161],
     | 30.00th=[  262], 40.00th=[  300], 50.00th=[  351], 60.00th=[  368],
     | 70.00th=[  384], 80.00th=[  401], 90.00th=[  422], 95.00th=[  435],
     | 99.00th=[  493], 99.50th=[ 3339], 99.90th=[17113], 99.95th=[17113],
     | 99.99th=[17113]
   bw (  KiB/s): min=   62, max= 2565, per=14.62%, avg=830.89, stdev=385.37, samples=1010
   iops        : min=    1, max=   80, avg=25.68, stdev=12.02, samples=1010
  lat (msec)   : 100=3.70%, 250=24.22%, 500=71.22%, 750=0.29%, >=2000=0.56%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.3%, 8=99.7%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,13310,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=6324: Sat Nov 11 23:31:24 2023
  write: IOPS=22, BW=710KiB/s (727kB/s)(416MiB/600297msec); 0 zone resets
    slat (usec): min=54, max=4188, avg=873.87, stdev=583.90
    clat (msec): min=61, max=31394, avg=359.67, stdev=1065.81
     lat (msec): min=61, max=31395, avg=360.54, stdev=1065.84
    clat percentiles (msec):
     |  1.00th=[   80],  5.00th=[  107], 10.00th=[  126], 20.00th=[  161],
     | 30.00th=[  264], 40.00th=[  300], 50.00th=[  351], 60.00th=[  368],
     | 70.00th=[  384], 80.00th=[  401], 90.00th=[  422], 95.00th=[  435],
     | 99.00th=[  485], 99.50th=[ 3339], 99.90th=[17113], 99.95th=[17113],
     | 99.99th=[17113]
   bw (  KiB/s): min=   62, max= 2693, per=14.63%, avg=831.46, stdev=385.79, samples=1010
   iops        : min=    1, max=   84, avg=25.70, stdev=12.03, samples=1010
  lat (msec)   : 100=3.66%, 250=24.22%, 500=71.29%, 750=0.26%, >=2000=0.56%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.2%, 8=99.7%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,13318,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=5712: Sat Nov 11 23:31:24 2023
  write: IOPS=22, BW=711KiB/s (728kB/s)(417MiB/600298msec); 0 zone resets
    slat (usec): min=48, max=4943, avg=875.41, stdev=575.74
    clat (msec): min=61, max=31374, avg=359.18, stdev=1062.56
     lat (msec): min=62, max=31376, avg=360.06, stdev=1062.59
    clat percentiles (msec):
     |  1.00th=[   80],  5.00th=[  108], 10.00th=[  127], 20.00th=[  161],
     | 30.00th=[  262], 40.00th=[  300], 50.00th=[  351], 60.00th=[  368],
     | 70.00th=[  380], 80.00th=[  397], 90.00th=[  422], 95.00th=[  435],
     | 99.00th=[  489], 99.50th=[ 3339], 99.90th=[17113], 99.95th=[17113],
     | 99.99th=[17113]
   bw (  KiB/s): min=   62, max= 2629, per=14.65%, avg=832.57, stdev=385.06, samples=1010
   iops        : min=    1, max=   82, avg=25.73, stdev=12.01, samples=1010
  lat (msec)   : 100=3.72%, 250=24.24%, 500=71.22%, 750=0.25%, >=2000=0.56%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.3%, 8=99.7%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,13336,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=8520: Sat Nov 11 23:31:24 2023
  write: IOPS=22, BW=709KiB/s (726kB/s)(416MiB/600284msec); 0 zone resets
    slat (usec): min=46, max=5246, avg=875.78, stdev=590.37
    clat (msec): min=53, max=31429, avg=360.06, stdev=1065.23
     lat (msec): min=53, max=31431, avg=360.94, stdev=1065.25
    clat percentiles (msec):
     |  1.00th=[   80],  5.00th=[  107], 10.00th=[  127], 20.00th=[  161],
     | 30.00th=[  262], 40.00th=[  300], 50.00th=[  351], 60.00th=[  368],
     | 70.00th=[  380], 80.00th=[  401], 90.00th=[  422], 95.00th=[  435],
     | 99.00th=[  485], 99.50th=[ 3339], 99.90th=[17113], 99.95th=[17113],
     | 99.99th=[17113]
   bw (  KiB/s): min=   62, max= 2693, per=14.62%, avg=830.52, stdev=384.44, samples=1010
   iops        : min=    1, max=   84, avg=25.67, stdev=11.99, samples=1010
  lat (msec)   : 100=3.72%, 250=24.18%, 500=71.35%, 750=0.18%, >=2000=0.56%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.3%, 8=99.7%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,13303,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=6192: Sat Nov 11 23:31:24 2023
  write: IOPS=22, BW=709KiB/s (726kB/s)(416MiB/600290msec); 0 zone resets
    slat (usec): min=42, max=5372, avg=883.05, stdev=589.35
    clat (msec): min=57, max=31449, avg=359.95, stdev=1073.91
     lat (msec): min=58, max=31449, avg=360.84, stdev=1073.94
    clat percentiles (msec):
     |  1.00th=[   80],  5.00th=[  108], 10.00th=[  126], 20.00th=[  159],
     | 30.00th=[  264], 40.00th=[  300], 50.00th=[  351], 60.00th=[  368],
     | 70.00th=[  384], 80.00th=[  397], 90.00th=[  422], 95.00th=[  439],
     | 99.00th=[  493], 99.50th=[ 2836], 99.90th=[17113], 99.95th=[17113],
     | 99.99th=[17113]
   bw (  KiB/s): min=   62, max= 2629, per=14.62%, avg=830.76, stdev=386.67, samples=1010
   iops        : min=    1, max=   82, avg=25.68, stdev=12.06, samples=1010
  lat (msec)   : 100=3.76%, 250=24.03%, 500=71.40%, 750=0.25%, >=2000=0.56%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.3%, 8=99.7%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,13307,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=7048: Sat Nov 11 23:31:24 2023
  write: IOPS=22, BW=710KiB/s (727kB/s)(416MiB/600299msec); 0 zone resets
    slat (usec): min=51, max=4253, avg=875.25, stdev=582.45
    clat (msec): min=53, max=31397, avg=359.51, stdev=1071.81
     lat (msec): min=53, max=31397, avg=360.38, stdev=1071.83
    clat percentiles (msec):
     |  1.00th=[   80],  5.00th=[  108], 10.00th=[  126], 20.00th=[  161],
     | 30.00th=[  264], 40.00th=[  300], 50.00th=[  351], 60.00th=[  368],
     | 70.00th=[  380], 80.00th=[  397], 90.00th=[  418], 95.00th=[  435],
     | 99.00th=[  489], 99.50th=[ 2836], 99.90th=[17113], 99.95th=[17113],
     | 99.99th=[17113]
   bw (  KiB/s): min=  123, max= 2629, per=14.65%, avg=832.68, stdev=385.08, samples=1009
   iops        : min=    3, max=   82, avg=25.73, stdev=12.01, samples=1009
  lat (msec)   : 100=3.70%, 250=24.11%, 500=71.34%, 750=0.23%, 1000=0.06%
  lat (msec)   : >=2000=0.56%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=0.1%, 4=0.3%, 8=99.7%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,13324,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
  WRITE: bw=5678KiB/s (5815kB/s), 709KiB/s-711KiB/s (726kB/s-728kB/s), io=3329MiB (3491MB), run=600284-600300msec
EchterAgo commented 9 months ago

Hmmm, very weird, fio is also slow in the case where rclone is fast???? Disabling preallocation makes fio fast again.

Now I'm wondering if rclone actually changes its IO pattern based on some kind of block size detection?

EchterAgo commented 9 months ago

I'm reopening this because I think we still need to figure this out further.

EchterAgo commented 9 months ago

rclones write pattern is the same whether I set z_blksize or not, always sequential 32KiB writes from multiple threads (8 in this case) with a preallocation size that is rounded up to the next block size. The reported file block size (IRP_MJ_QUERY_VOLUME_INFORMATION on the file) always seems to be 512, which is something we should probably fix too:

image

EchterAgo commented 9 months ago

Yea, so the IRP_MJ_WRITEs just take a lot longer, in the slow case a bit over 4ms for 32KiB, in the fast case 0.12ms for 32KiB

EchterAgo commented 9 months ago

Ok, so the difference with fio compared to rclone is that my current fio configuration uses direct IO (non-cached). Disabling direct IO causes fio to behave the same as rclone, including crashing / hanging when aborting.

This is the config:

[global]
ioengine=windowsaio
direct=0
iodepth=8
time_based
runtime=600
directory=D\:\testdir
filename_format=testfile_$jobname_$jobnum_$filenum.bin
bs=32k
size=16G
fallocate=native

[sequential_write]
rw=write
numjobs=8

I wonder why direct IO is always so much slower.

EchterAgo commented 9 months ago

I also noticed that rclone seems to be using slightly different write path when slow?

Fast:

image

Slow:

image

Notice the difference in path. I'm really not sure why it does this. Call stacks of both look the same:

image

EchterAgo commented 9 months ago

So, to summarize what I found out:

EchterAgo commented 9 months ago

fio results:

Bad zfs-Windows-2.2.0-release (eb7f88a64fe782d2b793d53b56f2bd9bacc9d694):

sequential_write: (groupid=0, jobs=1): err= 0: pid=10028: Sun Nov 12 04:39:08 2023
  write: IOPS=25, BW=826KiB/s (846kB/s)(485MiB/600328msec); 0 zone resets
    slat (usec): min=35, max=7279, avg=929.57, stdev=649.23
    clat (nsec): min=1300, max=11682M, avg=307850031.60, stdev=678554956.15
     lat (usec): min=136, max=11683k, avg=308779.60, stdev=678557.72
    clat percentiles (usec):
     |  1.00th=[     553],  5.00th=[    1352], 10.00th=[    2008],
     | 20.00th=[    3032], 30.00th=[    4080], 40.00th=[    5407],
     | 50.00th=[    8356], 60.00th=[   80217], 70.00th=[  202376],
     | 80.00th=[  425722], 90.00th=[ 1035994], 95.00th=[ 1568670],
     | 99.00th=[ 3372221], 99.50th=[ 3942646], 99.90th=[ 5737808],
     | 99.95th=[11609834], 99.99th=[11744052]
   bw (  KiB/s): min=   62, max=28352, per=18.30%, avg=1588.57, stdev=1685.94, samples=617
   iops        : min=    1, max=  886, avg=49.31, stdev=52.68, samples=617
  lat (usec)   : 2=0.01%, 4=0.01%, 100=0.01%, 250=0.21%, 500=0.64%
  lat (usec)   : 750=0.74%, 1000=1.28%
  lat (msec)   : 2=7.08%, 4=19.13%, 10=22.84%, 20=1.66%, 50=2.70%
  lat (msec)   : 100=5.59%, 250=10.60%, 500=9.90%, 750=4.49%, 1000=2.13%
  lat (msec)   : 2000=8.46%, >=2000=2.55%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=2.8%, 4=46.4%, 8=50.8%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=97.4%, 8=2.6%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,15505,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=10012: Sun Nov 12 04:39:08 2023
  write: IOPS=37, BW=1197KiB/s (1225kB/s)(702MiB/600546msec); 0 zone resets
    slat (usec): min=34, max=5554, avg=781.06, stdev=564.89
    clat (nsec): min=1100, max=11798M, avg=211887205.32, stdev=610879410.10
     lat (usec): min=97, max=11799k, avg=212668.26, stdev=610935.71
    clat percentiles (usec):
     |  1.00th=[     233],  5.00th=[     586], 10.00th=[     955],
     | 20.00th=[    1631], 30.00th=[    2245], 40.00th=[    2900],
     | 50.00th=[    3589], 60.00th=[    4686], 70.00th=[    8225],
     | 80.00th=[  175113], 90.00th=[  708838], 95.00th=[ 1216349],
     | 99.00th=[ 3003122], 99.50th=[ 3942646], 99.90th=[ 5804917],
     | 99.95th=[ 6006244], 99.99th=[11744052]
   bw (  KiB/s): min=   62, max=74161, per=28.14%, avg=2441.82, stdev=7804.85, samples=581
   iops        : min=    1, max= 2317, avg=75.96, stdev=243.90, samples=581
  lat (usec)   : 2=0.15%, 4=0.10%, 10=0.03%, 20=0.01%, 100=0.01%
  lat (usec)   : 250=0.80%, 500=2.76%, 750=3.39%, 1000=3.46%
  lat (msec)   : 2=15.07%, 4=28.73%, 10=16.86%, 20=0.82%, 50=1.53%
  lat (msec)   : 100=2.97%, 250=5.86%, 500=4.74%, 750=2.88%, 1000=1.55%
  lat (msec)   : 2000=6.52%, >=2000=1.74%
  cpu          : usr=0.00%, sys=0.67%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.3%, 2=13.4%, 4=50.8%, 8=35.5%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.9%, 8=6.1%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,22455,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=10004: Sun Nov 12 04:39:08 2023
  write: IOPS=39, BW=1255KiB/s (1285kB/s)(736MiB/600158msec); 0 zone resets
    slat (usec): min=38, max=6544, avg=770.36, stdev=551.01
    clat (nsec): min=1400, max=12411M, avg=201795987.20, stdev=591894692.07
     lat (usec): min=130, max=12411k, avg=202566.35, stdev=591953.49
    clat percentiles (usec):
     |  1.00th=[     235],  5.00th=[     594], 10.00th=[     955],
     | 20.00th=[    1614], 30.00th=[    2245], 40.00th=[    2835],
     | 50.00th=[    3523], 60.00th=[    4424], 70.00th=[    6915],
     | 80.00th=[  139461], 90.00th=[  692061], 95.00th=[ 1249903],
     | 99.00th=[ 2768241], 99.50th=[ 3774874], 99.90th=[ 5402264],
     | 99.95th=[ 5737808], 99.99th=[12415140]
   bw (  KiB/s): min=   62, max=71360, per=30.88%, avg=2679.66, stdev=8360.34, samples=555
   iops        : min=    1, max= 2230, avg=83.38, stdev=261.26, samples=555
  lat (usec)   : 2=0.21%, 4=0.11%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 250=0.75%, 500=2.74%, 750=3.32%, 1000=3.49%
  lat (msec)   : 2=15.48%, 4=30.02%, 10=17.04%, 20=0.97%, 50=1.47%
  lat (msec)   : 100=2.71%, 250=5.42%, 500=4.36%, 750=2.37%, 1000=1.10%
  lat (msec)   : 2000=6.88%, >=2000=1.54%
  cpu          : usr=0.00%, sys=0.67%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.4%, 2=14.7%, 4=51.6%, 8=33.3%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.3%, 8=6.7%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,23544,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=9948: Sun Nov 12 04:39:08 2023
  write: IOPS=39, BW=1269KiB/s (1300kB/s)(744MiB/600257msec); 0 zone resets
    slat (usec): min=37, max=5571, avg=747.25, stdev=537.40
    clat (nsec): min=1500, max=9393.5M, avg=199599392.34, stdev=566584013.84
     lat (usec): min=121, max=9393.9k, avg=200346.64, stdev=566640.96
    clat percentiles (usec):
     |  1.00th=[    235],  5.00th=[    545], 10.00th=[    881],
     | 20.00th=[   1516], 30.00th=[   2114], 40.00th=[   2704],
     | 50.00th=[   3359], 60.00th=[   4228], 70.00th=[   6587],
     | 80.00th=[ 133694], 90.00th=[ 717226], 95.00th=[1266680],
     | 99.00th=[2701132], 99.50th=[3472884], 99.90th=[5469373],
     | 99.95th=[6408897], 99.99th=[9328133]
   bw (  KiB/s): min=   62, max=72640, per=31.42%, avg=2726.08, stdev=8767.12, samples=552
   iops        : min=    1, max= 2270, avg=84.86, stdev=273.99, samples=552
  lat (usec)   : 2=0.13%, 4=0.11%, 10=0.02%, 100=0.01%, 250=0.82%
  lat (usec)   : 500=3.22%, 750=3.60%, 1000=3.93%
  lat (msec)   : 2=16.09%, 4=30.01%, 10=15.76%, 20=0.71%, 50=1.45%
  lat (msec)   : 100=2.53%, 250=5.85%, 500=3.80%, 750=2.07%, 1000=1.48%
  lat (msec)   : 2000=6.81%, >=2000=1.60%
  cpu          : usr=0.00%, sys=0.67%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.3%, 2=15.1%, 4=51.4%, 8=33.3%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.3%, 8=6.7%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,23813,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=9976: Sun Nov 12 04:39:08 2023
  write: IOPS=40, BW=1290KiB/s (1321kB/s)(757MiB/600813msec); 0 zone resets
    slat (usec): min=41, max=5376, avg=765.91, stdev=551.19
    clat (nsec): min=1200, max=13163M, avg=196334710.01, stdev=593207893.35
     lat (usec): min=116, max=13166k, avg=197100.62, stdev=593258.20
    clat percentiles (usec):
     |  1.00th=[     235],  5.00th=[     570], 10.00th=[     898],
     | 20.00th=[    1516], 30.00th=[    2114], 40.00th=[    2737],
     | 50.00th=[    3392], 60.00th=[    4293], 70.00th=[    6521],
     | 80.00th=[  132645], 90.00th=[  616563], 95.00th=[ 1249903],
     | 99.00th=[ 2231370], 99.50th=[ 3607102], 99.90th=[ 6006244],
     | 99.95th=[ 6274679], 99.99th=[13220447]
   bw (  KiB/s): min=   62, max=71510, per=30.78%, avg=2670.03, stdev=8457.94, samples=573
   iops        : min=    1, max= 2234, avg=83.08, stdev=264.32, samples=573
  lat (usec)   : 2=0.15%, 4=0.08%, 10=0.01%, 50=0.01%, 100=0.01%
  lat (usec)   : 250=0.85%, 500=2.91%, 750=3.81%, 1000=3.77%
  lat (msec)   : 2=16.33%, 4=29.23%, 10=16.96%, 20=0.95%, 50=1.13%
  lat (msec)   : 100=2.69%, 250=5.14%, 500=4.48%, 750=2.51%, 1000=1.05%
  lat (msec)   : 2000=6.30%, >=2000=1.66%
  cpu          : usr=0.00%, sys=0.67%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.2%, 2=14.6%, 4=50.8%, 8=34.4%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=93.6%, 8=6.4%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,24224,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=9936: Sun Nov 12 04:39:08 2023
  write: IOPS=30, BW=971KiB/s (995kB/s)(569MiB/600258msec); 0 zone resets
    slat (usec): min=43, max=5535, avg=831.52, stdev=603.84
    clat (nsec): min=1300, max=10806M, avg=261601082.24, stdev=649005894.65
     lat (usec): min=157, max=10806k, avg=262432.61, stdev=649042.97
    clat percentiles (usec):
     |  1.00th=[     281],  5.00th=[     725], 10.00th=[    1188],
     | 20.00th=[    1991], 30.00th=[    2704], 40.00th=[    3523],
     | 50.00th=[    4686], 60.00th=[    7177], 70.00th=[  100140],
     | 80.00th=[  316670], 90.00th=[ 1035994], 95.00th=[ 1451230],
     | 99.00th=[ 2902459], 99.50th=[ 4009755], 99.90th=[ 6610224],
     | 99.95th=[ 6811550], 99.99th=[10804528]
   bw (  KiB/s): min=   62, max=70912, per=22.59%, avg=1960.99, stdev=5916.94, samples=588
   iops        : min=    1, max= 2216, avg=60.94, stdev=184.91, samples=588
  lat (usec)   : 2=0.10%, 4=0.04%, 20=0.01%, 50=0.01%, 250=0.67%
  lat (usec)   : 500=2.10%, 750=2.33%, 1000=2.71%
  lat (msec)   : 2=12.07%, 4=24.67%, 10=18.86%, 20=1.10%, 50=1.72%
  lat (msec)   : 100=3.62%, 250=7.78%, 500=6.88%, 750=2.85%, 1000=1.65%
  lat (msec)   : 2000=9.12%, >=2000=1.71%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.2%, 2=9.4%, 4=49.2%, 8=41.2%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=95.3%, 8=4.7%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,18219,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=10008: Sun Nov 12 04:39:08 2023
  write: IOPS=30, BW=960KiB/s (983kB/s)(563MiB/600257msec); 0 zone resets
    slat (usec): min=42, max=6136, avg=846.13, stdev=607.05
    clat (nsec): min=1100, max=10650M, avg=264571256.47, stdev=650836973.18
     lat (usec): min=107, max=10650k, avg=265417.39, stdev=650862.27
    clat percentiles (usec):
     |  1.00th=[     293],  5.00th=[     717], 10.00th=[    1188],
     | 20.00th=[    1975], 30.00th=[    2737], 40.00th=[    3523],
     | 50.00th=[    4686], 60.00th=[    6980], 70.00th=[   99091],
     | 80.00th=[  341836], 90.00th=[ 1019216], 95.00th=[ 1417675],
     | 99.00th=[ 3003122], 99.50th=[ 4278191], 99.90th=[ 5804917],
     | 99.95th=[ 6073353], 99.99th=[10670310]
   bw (  KiB/s): min=   62, max=72583, per=22.22%, avg=1928.89, stdev=5898.11, samples=591
   iops        : min=    1, max= 2268, avg=59.93, stdev=184.33, samples=591
  lat (usec)   : 2=0.10%, 4=0.03%, 250=0.64%, 500=2.05%, 750=2.50%
  lat (usec)   : 1000=2.70%
  lat (msec)   : 2=12.25%, 4=24.59%, 10=19.23%, 20=0.99%, 50=1.34%
  lat (msec)   : 100=3.64%, 250=6.90%, 500=7.01%, 750=3.79%, 1000=1.85%
  lat (msec)   : 2000=8.42%, >=2000=1.98%
  cpu          : usr=0.00%, sys=0.50%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=9.7%, 4=49.3%, 8=40.8%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=95.2%, 8=4.8%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,18013,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=10016: Sun Nov 12 04:39:08 2023
  write: IOPS=28, BW=912KiB/s (934kB/s)(535MiB/600259msec); 0 zone resets
    slat (usec): min=40, max=5853, avg=874.99, stdev=627.20
    clat (nsec): min=1100, max=12888M, avg=278624238.24, stdev=672448636.08
     lat (usec): min=245, max=12889k, avg=279499.23, stdev=672471.05
    clat percentiles (usec):
     |  1.00th=[     334],  5.00th=[     873], 10.00th=[    1434],
     | 20.00th=[    2311], 30.00th=[    3130], 40.00th=[    4146],
     | 50.00th=[    5800], 60.00th=[   14615], 70.00th=[  154141],
     | 80.00th=[  375391], 90.00th=[  985662], 95.00th=[ 1551893],
     | 99.00th=[ 3238003], 99.50th=[ 3741320], 99.90th=[ 5872026],
     | 99.95th=[ 6006244], 99.99th=[12952011]
   bw (  KiB/s): min=   62, max=71936, per=20.61%, avg=1788.46, stdev=4683.82, samples=606
   iops        : min=    1, max= 2248, avg=55.53, stdev=146.39, samples=606
  lat (usec)   : 2=0.09%, 4=0.02%, 10=0.01%, 20=0.01%, 250=0.41%
  lat (usec)   : 500=1.62%, 750=1.85%, 1000=2.00%
  lat (msec)   : 2=10.31%, 4=22.62%, 10=19.90%, 20=1.41%, 50=1.88%
  lat (msec)   : 100=4.10%, 250=9.11%, 500=8.76%, 750=4.31%, 1000=1.65%
  lat (msec)   : 2000=7.38%, >=2000=2.56%
  cpu          : usr=0.00%, sys=0.33%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.1%, 2=6.8%, 4=47.2%, 8=45.9%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=96.2%, 8=3.8%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,17116,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
  WRITE: bw=8676KiB/s (8884kB/s), 826KiB/s-1290KiB/s (846kB/s-1321kB/s), io=5090MiB (5338MB), run=600158-600813msec

Good:

diff --git a/module/os/windows/zfs/zfs_znode.c b/module/os/windows/zfs/zfs_znode.c
index f501b3ac1..eff94188f 100644
--- a/module/os/windows/zfs/zfs_znode.c
+++ b/module/os/windows/zfs/zfs_znode.c
@@ -624,9 +624,6 @@ zfs_znode_alloc(zfsvfs_t *zfsvfs, dmu_buf_t *db, int blksz,
        zp->z_zfsvfs = zfsvfs;
        mutex_exit(&zfsvfs->z_znodes_lock);

-       if (zp->z_blksz == 0)
-               zp->z_blksz = zfs_blksz(zp);
-
        return (zp);
 }
sequential_write: (groupid=0, jobs=1): err= 0: pid=6380: Sun Nov 12 04:53:38 2023
  write: IOPS=1139, BW=35.6MiB/s (37.3MB/s)(20.9GiB/600015msec); 0 zone resets
    slat (usec): min=27, max=1618.5k, avg=456.55, stdev=4595.92
    clat (nsec): min=700, max=1625.9M, avg=5272008.86, stdev=24928883.42
     lat (usec): min=59, max=1626.1k, avg=5728.56, stdev=25345.08
    clat percentiles (usec):
     |  1.00th=[    99],  5.00th=[   249], 10.00th=[   388], 20.00th=[   635],
     | 30.00th=[   906], 40.00th=[  1221], 50.00th=[  1582], 60.00th=[  1991],
     | 70.00th=[  2507], 80.00th=[  3195], 90.00th=[  4555], 95.00th=[  9110],
     | 99.00th=[128451], 99.50th=[175113], 99.90th=[316670], 99.95th=[358613],
     | 99.99th=[526386]
   bw (  KiB/s): min=  128, max=114020, per=12.56%, avg=36829.87, stdev=27553.48, samples=1172
   iops        : min=    4, max= 3563, avg=1150.53, stdev=861.00, samples=1172
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.29%, 4=0.13%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.57%, 250=4.05%, 500=9.41%, 750=9.88%, 1000=8.74%
  lat (msec)   : 2=27.18%, 4=26.94%, 10=8.06%, 20=1.65%, 50=1.23%
  lat (msec)   : 100=0.32%, 250=1.31%, 500=0.19%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.50%, sys=18.50%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.5%, 2=23.5%, 4=56.8%, 8=19.3%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=89.2%, 8=10.8%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,683421,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=8360: Sun Nov 12 04:53:38 2023
  write: IOPS=1148, BW=35.9MiB/s (37.6MB/s)(21.0GiB/600033msec); 0 zone resets
    slat (usec): min=29, max=1618.6k, avg=454.85, stdev=4799.65
    clat (nsec): min=600, max=1620.8M, avg=5212781.94, stdev=24071738.20
     lat (usec): min=53, max=1621.2k, avg=5667.64, stdev=24542.01
    clat percentiles (usec):
     |  1.00th=[    98],  5.00th=[   241], 10.00th=[   379], 20.00th=[   619],
     | 30.00th=[   881], 40.00th=[  1205], 50.00th=[  1565], 60.00th=[  1975],
     | 70.00th=[  2474], 80.00th=[  3195], 90.00th=[  4555], 95.00th=[  8979],
     | 99.00th=[128451], 99.50th=[173016], 99.90th=[308282], 99.95th=[333448],
     | 99.99th=[633340]
   bw (  KiB/s): min=  128, max=127313, per=12.66%, avg=37122.23, stdev=27642.35, samples=1173
   iops        : min=    4, max= 3978, avg=1159.65, stdev=863.77, samples=1173
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.28%, 4=0.12%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.62%, 250=4.26%, 500=9.66%, 750=10.17%, 1000=8.78%
  lat (msec)   : 2=26.76%, 4=26.64%, 10=7.99%, 20=1.65%, 50=1.20%
  lat (msec)   : 100=0.34%, 250=1.33%, 500=0.17%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.50%, sys=18.83%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.5%, 2=23.7%, 4=57.0%, 8=18.9%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=89.1%, 8=10.9%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,689243,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=8356: Sun Nov 12 04:53:38 2023
  write: IOPS=1135, BW=35.5MiB/s (37.2MB/s)(20.8GiB/600004msec); 0 zone resets
    slat (usec): min=28, max=1618.6k, avg=471.20, stdev=5349.31
    clat (nsec): min=900, max=1620.6M, avg=5242952.86, stdev=24132216.73
     lat (usec): min=66, max=1620.9k, avg=5714.15, stdev=24712.24
    clat percentiles (usec):
     |  1.00th=[   102],  5.00th=[   255], 10.00th=[   400], 20.00th=[   660],
     | 30.00th=[   938], 40.00th=[  1254], 50.00th=[  1614], 60.00th=[  2024],
     | 70.00th=[  2540], 80.00th=[  3261], 90.00th=[  4686], 95.00th=[  9372],
     | 99.00th=[125305], 99.50th=[170918], 99.90th=[308282], 99.95th=[333448],
     | 99.99th=[557843]
   bw (  KiB/s): min=  500, max=135759, per=12.51%, avg=36703.27, stdev=26848.21, samples=1172
   iops        : min=   15, max= 4242, avg=1146.55, stdev=838.97, samples=1172
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.29%, 4=0.16%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.48%, 250=3.87%, 500=8.93%, 750=9.66%, 1000=8.77%
  lat (msec)   : 2=27.43%, 4=27.16%, 10=8.42%, 20=1.67%, 50=1.33%
  lat (msec)   : 100=0.36%, 250=1.27%, 500=0.17%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.50%, sys=19.00%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.5%, 2=23.2%, 4=56.6%, 8=19.6%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=89.4%, 8=10.6%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,681247,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=6252: Sun Nov 12 04:53:38 2023
  write: IOPS=1163, BW=36.4MiB/s (38.1MB/s)(21.3GiB/600005msec); 0 zone resets
    slat (usec): min=28, max=1619.7k, avg=448.13, stdev=4857.41
    clat (nsec): min=900, max=1621.9M, avg=5150571.87, stdev=23767175.05
     lat (usec): min=52, max=1622.1k, avg=5598.70, stdev=24254.03
    clat percentiles (usec):
     |  1.00th=[    93],  5.00th=[   233], 10.00th=[   367], 20.00th=[   611],
     | 30.00th=[   873], 40.00th=[  1188], 50.00th=[  1549], 60.00th=[  1958],
     | 70.00th=[  2474], 80.00th=[  3163], 90.00th=[  4555], 95.00th=[  9372],
     | 99.00th=[125305], 99.50th=[168821], 99.90th=[299893], 99.95th=[333448],
     | 99.99th=[633340]
   bw (  KiB/s): min=  192, max=238492, per=12.83%, avg=37615.90, stdev=29260.82, samples=1173
   iops        : min=    6, max= 7452, avg=1175.09, stdev=914.35, samples=1173
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.25%, 4=0.22%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.69%, 250=4.40%, 500=9.82%, 750=10.14%, 1000=8.75%
  lat (msec)   : 2=26.79%, 4=26.33%, 10=7.79%, 20=1.72%, 50=1.30%
  lat (msec)   : 100=0.31%, 250=1.29%, 500=0.17%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.50%, sys=18.67%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.5%, 2=23.5%, 4=56.8%, 8=19.2%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=89.2%, 8=10.8%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,698194,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=8388: Sun Nov 12 04:53:38 2023
  write: IOPS=1153, BW=36.1MiB/s (37.8MB/s)(21.1GiB/600003msec); 0 zone resets
    slat (usec): min=28, max=1618.3k, avg=452.19, stdev=4883.17
    clat (nsec): min=600, max=1623.7M, avg=5221157.53, stdev=24483382.89
     lat (usec): min=66, max=1624.1k, avg=5673.35, stdev=24962.35
    clat percentiles (usec):
     |  1.00th=[    96],  5.00th=[   239], 10.00th=[   375], 20.00th=[   619],
     | 30.00th=[   881], 40.00th=[  1188], 50.00th=[  1549], 60.00th=[  1958],
     | 70.00th=[  2474], 80.00th=[  3163], 90.00th=[  4490], 95.00th=[  8979],
     | 99.00th=[128451], 99.50th=[173016], 99.90th=[308282], 99.95th=[333448],
     | 99.99th=[641729]
   bw (  KiB/s): min=  498, max=247953, per=12.71%, avg=37264.06, stdev=28461.24, samples=1173
   iops        : min=   15, max= 7748, avg=1164.10, stdev=889.38, samples=1173
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.26%, 4=0.15%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.66%, 250=4.30%, 500=9.74%, 750=10.10%, 1000=8.72%
  lat (msec)   : 2=26.95%, 4=26.59%, 10=7.83%, 20=1.59%, 50=1.24%
  lat (msec)   : 100=0.33%, 250=1.31%, 500=0.18%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.50%, sys=19.17%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.5%, 2=23.7%, 4=56.9%, 8=19.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=89.2%, 8=10.8%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,692245,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=8320: Sun Nov 12 04:53:38 2023
  write: IOPS=1150, BW=35.9MiB/s (37.7MB/s)(21.1GiB/600003msec); 0 zone resets
    slat (usec): min=28, max=1618.6k, avg=445.47, stdev=4232.55
    clat (nsec): min=800, max=1630.6M, avg=5279443.64, stdev=24975772.09
     lat (usec): min=70, max=1630.7k, avg=5724.92, stdev=25328.10
    clat percentiles (usec):
     |  1.00th=[    99],  5.00th=[   241], 10.00th=[   379], 20.00th=[   627],
     | 30.00th=[   889], 40.00th=[  1205], 50.00th=[  1549], 60.00th=[  1958],
     | 70.00th=[  2474], 80.00th=[  3163], 90.00th=[  4555], 95.00th=[  9241],
     | 99.00th=[126354], 99.50th=[168821], 99.90th=[316670], 99.95th=[350225],
     | 99.99th=[666895]
   bw (  KiB/s): min=  499, max=162171, per=12.65%, avg=37107.40, stdev=28312.63, samples=1174
   iops        : min=   15, max= 5067, avg=1159.20, stdev=884.72, samples=1174
  lat (nsec)   : 1000=0.01%
  lat (usec)   : 2=0.28%, 4=0.11%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.60%, 250=4.28%, 500=9.63%, 750=10.15%, 1000=8.76%
  lat (msec)   : 2=27.04%, 4=26.52%, 10=7.81%, 20=1.59%, 50=1.35%
  lat (msec)   : 100=0.35%, 250=1.28%, 500=0.19%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.67%, sys=19.00%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.5%, 2=23.6%, 4=56.9%, 8=19.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=89.2%, 8=10.8%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,690008,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=9700: Sun Nov 12 04:53:38 2023
  write: IOPS=1136, BW=35.5MiB/s (37.2MB/s)(20.8GiB/600011msec); 0 zone resets
    slat (usec): min=28, max=1619.6k, avg=465.10, stdev=5020.55
    clat (nsec): min=600, max=1620.7M, avg=5283678.99, stdev=24997035.41
     lat (usec): min=57, max=1623.4k, avg=5748.77, stdev=25489.80
    clat percentiles (usec):
     |  1.00th=[   103],  5.00th=[   255], 10.00th=[   400], 20.00th=[   660],
     | 30.00th=[   930], 40.00th=[  1254], 50.00th=[  1598], 60.00th=[  2008],
     | 70.00th=[  2540], 80.00th=[  3228], 90.00th=[  4621], 95.00th=[  9503],
     | 99.00th=[127402], 99.50th=[173016], 99.90th=[308282], 99.95th=[337642],
     | 99.99th=[641729]
   bw (  KiB/s): min=  256, max=104773, per=12.54%, avg=36765.79, stdev=27041.13, samples=1172
   iops        : min=    8, max= 3274, avg=1148.52, stdev=844.99, samples=1172
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.29%, 4=0.13%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.49%, 250=3.90%, 500=8.97%, 750=9.78%, 1000=8.74%
  lat (msec)   : 2=27.33%, 4=27.17%, 10=8.33%, 20=1.74%, 50=1.28%
  lat (msec)   : 100=0.37%, 250=1.23%, 500=0.19%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.50%, sys=19.00%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.5%, 2=23.4%, 4=56.7%, 8=19.5%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=89.3%, 8=10.7%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,682043,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8
sequential_write: (groupid=0, jobs=1): err= 0: pid=10012: Sun Nov 12 04:53:38 2023
  write: IOPS=1139, BW=35.6MiB/s (37.3MB/s)(20.9GiB/600136msec); 0 zone resets
    slat (usec): min=29, max=1618.5k, avg=459.64, stdev=4991.50
    clat (nsec): min=500, max=1620.3M, avg=5263147.47, stdev=24610201.30
     lat (usec): min=60, max=1620.5k, avg=5722.78, stdev=25105.81
    clat percentiles (usec):
     |  1.00th=[   102],  5.00th=[   255], 10.00th=[   396], 20.00th=[   652],
     | 30.00th=[   922], 40.00th=[  1237], 50.00th=[  1598], 60.00th=[  2008],
     | 70.00th=[  2507], 80.00th=[  3195], 90.00th=[  4555], 95.00th=[  9110],
     | 99.00th=[128451], 99.50th=[175113], 99.90th=[312476], 99.95th=[341836],
     | 99.99th=[534774]
   bw (  KiB/s): min=  192, max=119346, per=12.57%, avg=36866.39, stdev=27250.18, samples=1172
   iops        : min=    6, max= 3729, avg=1151.67, stdev=851.52, samples=1172
  lat (nsec)   : 750=0.01%, 1000=0.01%
  lat (usec)   : 2=0.29%, 4=0.14%, 10=0.01%, 20=0.01%, 50=0.01%
  lat (usec)   : 100=0.51%, 250=3.91%, 500=9.09%, 750=9.80%, 1000=8.76%
  lat (msec)   : 2=27.36%, 4=27.49%, 10=7.88%, 20=1.67%, 50=1.23%
  lat (msec)   : 100=0.34%, 250=1.31%, 500=0.17%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2000=0.01%
  cpu          : usr=0.67%, sys=19.00%, ctx=0, majf=0, minf=0
  IO depths    : 1=0.5%, 2=23.3%, 4=56.6%, 8=19.6%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=89.3%, 8=10.7%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,684054,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=8

Run status group 0 (all jobs):
  WRITE: bw=286MiB/s (300MB/s), 35.5MiB/s-36.4MiB/s (37.2MB/s-38.1MB/s), io=168GiB (180GB), run=600003-600136msec
lundman commented 9 months ago

OK, if I have understood things, I cleaned up the code in both places. If we are happy with it, we can close this.

Sorry for the delay, wanted to push out the macOS version