All writers block on same hash table mutex hurting performance

imtiazdc commented 4 years ago

I see a performance issue with all writers trying to acquire same mutex in dbuf_hash_table while writing to zvol (RAW disk). I am using Iometer to generate workload. During live debugging, I found that there are 8 threads executing dbuf_find (as part of zvol_write_win) out of which 7 are waiting for the mutex held by the 8th thread. The output of !stacks 2 zfsin is attached.

dmu_buf_impl_t *
dbuf_find(objset_t *os, uint64_t obj, uint8_t level, uint64_t blkid)
{
    dbuf_hash_table_t *h = &dbuf_hash_table;
    uint64_t hv = dbuf_hash(os, obj, level, blkid);
    uint64_t idx;
    dmu_buf_impl_t *db;
    idx = hv & h->hash_table_mask;
    mutex_enter(DBUF_HASH_MUTEX(h, idx));    // 7 threads wait here for the "mutex" held by 8th thread

Here's the problem, all 8 threads have same input parameters (shown below) to the function dbuf_find - 0xffffc78f`a4734180 0 0 0 which is causing them to map to the same mutex within the hash table.

Is this expected and mandatory in zvol writes? Are there ways to improve?

Below is my setup: VM configuration: 4 vCPU 8 GB RAM zvol settings: 8GB (thick provisioned) Everything is default. Dedup=off, Compression=off, ... Iometer settings: 4 workers Data pattern = full random No of outstanding IOs = 64 Access spec = 4 KiB; 0% Read; 100% random

Thanks, Imtiaz

imtiazdc commented 4 years ago

As can be seen from stack traces below, the writer threads are waiting for the mutex in open context (and not syncing context).

Stack trace of writer thread that has acquired the hash table mutex

``` 4.001524 ffffc78fb6226040 0000001 RUNNING nt!DbgBreakPointWithStatus nt!KeAccumulateTicks+0x39c nt!KeClockInterruptNotify+0xb8 hal!HalpTimerClockIpiRoutine+0x15 nt!KiCallInterruptServiceRoutine+0x106 nt!KiInterruptSubDispatchNoLockNoEtw+0xea nt!KiInterruptDispatchNoLockNoEtw+0x37 ZFSin!spl_mutex_enter+0x14b ZFSin!dbuf_find+0x105 ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 ```

Stack traces of 7 writer threads that are waiting for the mutex acquired by the thread above

``` 4.000ee4 ffffc78fb60c1800 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.001628 ffffc78fb5c07800 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.001688 ffffc780003e4040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.001820 ffffc78000f55040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.001840 ffffc7800230e040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.00186c ffffc78002302040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.0015c8 ffffc7801ade5040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 ```

lundman commented 4 years ago

This thread:

   4.001524  ffffc78fb6226040 0000001 RUNNING    nt!DbgBreakPointWithStatus
                                        nt!KeAccumulateTicks+0x39c
                                        nt!KeClockInterruptNotify+0xb8
                                        hal!HalpTimerClockIpiRoutine+0x15
                                        nt!KiCallInterruptServiceRoutine+0x106
                                        nt!KiInterruptSubDispatchNoLockNoEtw+0xea
                                        nt!KiInterruptDispatchNoLockNoEtw+0x37
                                        ZFSin!spl_mutex_enter+0x14b

is it still alive? It eventually finishes the task and releases the mutex? It's just that the stack looks exactly like that when you get DPC_WATCHDOG_VIOLATION and it has actually crashed.

imtiazdc commented 4 years ago

is it still alive? Yes It eventually finishes the task and releases the mutex? Yes

There is no crash, just that the write performance is significantly degraded. In perfmon, we see writes happening at like 10 MBps on zvol for couple of seconds, then it goes back to 0 for a few seconds, then again a burst of writes, then back to 0, etc, ...

I am suspecting the way locks are acquired is hurting the throughput.

lundman commented 4 years ago

Not particularly familiar with this code, but all 8 threads go down from zvol_write_win():

                dmu_tx_hold_write(tx, ZVOL_OBJ, off, bytes);

Where ZVOL_OBJ == 1 enters err = dnode_hold(os, object, FTAG, &dn); Where only the object passed along, not offset or bytes. (length)

We convert it:

        mdn = DMU_META_DNODE(os);
        blk = dbuf_whichblock(mdn, 0, object * sizeof (dnode_phys_t));

which is computed to the same value each time.

Even though there are 8 threads writing to different locations, they are all blocked by one mutex due to;

        uint64_t hv = dbuf_hash(os, obj, level, blkid);

As all arguments are not changing, the result is also computed to the same value each time.

@ahrens Did I miss anything? Are there any mitigations for multiple writing threads?

ahrens commented 4 years ago

I would recommend using dmu_tx_hold_write_by_dnode() instead of dmu_tx_hold_write(). Where the dnode_t* is saved in a zvol-specific struct when the zvol is opened. We should make the same change in openzfs as well.

imtiazdc commented 4 years ago

Unfortunately, the structs are different in Linux and Windows. Specifically, dnode_t* is missing from Windows' zvol_state_t. Windows: https://github.com/openzfsonwindows/ZFSin/blob/master/ZFSin/zfs/include/sys/zvol.h#L59 Linux: https://github.com/openzfs/zfs/blob/master/include/sys/zvol_impl.h#L41

Thoughts??

lundman commented 4 years ago

Awesome, cheers @ahrens .

@imtiazdc Let me work on porting the PR over, change zvol_state to our needs and work out how to open the zvol by saving the dnode.

lundman commented 4 years ago

OK it would look something like this (2 commits): https://github.com/openzfsonwindows/ZFSin/tree/bydnode/ZFSin

15e6376e87330348c9a8f013e42bbedc949c3a71 65faac1ab1468a43fea91a1933a4a95c1fc7f094

Took the chance to clean things up again, we don't actually need dmu_(read|write)_win() style functions, that is something OsX needs to pass the C++ class around.

I've not had a chance to test it much.

imtiazdc commented 4 years ago

Thanks @ahrens @lundman for the quick turnaround!

Above is a snapshot with these changes. To the left is write performance on zvol (8GB, thick provisioned, default settings, no dedup, etc). To the right is write performance on zpool (used Fixed Disk Type when serving the VHD from Hyper-V to this VM).

I still see writes dropping to 0 on zvol (on zpool the writes are still continuos). However, those 0 writes appear very less frequently compared to how it was without this latest change. Overall, the change looks good.

imtiazdc commented 4 years ago

We dedup ON, we can see significantly lower writes on zpool!

imtiazdc commented 4 years ago

@lundman I just picked all your changes from bydnode branch. Here are some observations:

The writes have gone back to being choppy. The writes looked better with the port from OpenZFS (without your sync variable change).
I saw 598 zfsin threads through the debugger when I tried to investigate the write choppiness. Is that a cause of concern? Would you know what is causing it? Last time I remember there used to be around 390 threads in the kernel that belonged to zfsin.

Of the 598, 333 threads are in blocked state with below stacktrace:

4.0014d8  ffffd709728c3040 0009f1a Blocked    nt!KiSwapContext+0x76
                                    nt!KiSwapThread+0x17d
                                    nt!KiCommitThreadWait+0x14f
                                    nt!KeWaitForMultipleObjects+0x1fe
                                    ZFSin!spl_cv_wait+0xf3
                                    ZFSin!taskq_thread_wait+0xb2
                                    ZFSin!taskq_thread+0x39e
                                    nt!PspSystemThreadStartup+0x41
                                    nt!KiStartSystemThread+0x16

and 250 threads are in blocked state with below stacktrace:

4.001648  ffffd7097258b040 000018b Blocked    nt!KiSwapContext+0x76
                                    nt!KiSwapThread+0x17d
                                    nt!KiCommitThreadWait+0x14f
                                    nt!KeWaitForMultipleObjects+0x1fe
                                    ZFSin!spl_cv_wait+0xf3
                                    ZFSin!txg_wait_synced+0x13b
                                    ZFSin!dmu_tx_wait+0x2b6
                                    ZFSin!dmu_tx_assign+0x17b
                                    ZFSin!zvol_write+0x235
                                    ZFSin!wzvol_WkRtn+0x419
                                    ZFSin!wzvol_GeneralWkRtn+0x6d
                                    nt!IopProcessWorkItem+0x12a
                                    nt!ExpWorkerThread+0xe9
                                    nt!PspSystemThreadStartup+0x41
                                    nt!KiStartSystemThread+0x16

There is 1 spa_sync thread and that is also blocked

4.00160c  ffffd7097290f080 0000191 Blocked    nt!KiSwapContext+0x76
                                    nt!KiSwapThread+0x17d
                                    nt!KiCommitThreadWait+0x14f
                                    nt!KeWaitForMultipleObjects+0x1fe
                                    ZFSin!spl_cv_wait+0xf3
                                    ZFSin!zio_wait+0x1b5
                                    ZFSin!dsl_pool_sync+0x2e7
                                    ZFSin!spa_sync+0xb23
                                    ZFSin!txg_sync_thread+0x3f9
                                    nt!PspSystemThreadStartup+0x41
                                    nt!KiStartSystemThread+0x16

Does that give any clue as to what might have gone wrong?

Stacks of all threads running zfsin code if that helps: threads.txt

imtiazdc commented 4 years ago

Update: going by the discussion on write amplification issue, I specified the ashift and volblocksize explicitly and the write amplification does look under control. However, the write choppiness remains. zpool create -o ashift=12 tank PHYSICALDRIVE2 zfs create -s -V 8g -o volblocksize=4k tank/zvol

I will break in to the debugger after a few hours of Iometer writes and see what is happening with the count of threads and locking.

lundman commented 4 years ago

Ok, lets see. The sync commit changes was that "sync was always true", no matter what. With sync writes on, zvol will wait for the data it reach the disk, before continuing. Safe, but slow. With sync=standard the writes get bunched together and flushed with spa_sync(txg). In theory, this should get you better/faster writes.

But more importantly, it should be user selectable now, ie, zfs set sync=always pool/volume should bring you the "smooth writes" as before.

Now, I've often wondered if there is some issue in mutex / condvar code, could it be missing signals/broadcasts? Sometimes it appears to have "supposed to have been signaled" but sits there doing nothing, until the next one comes in. Is this something you have noticed?

ZFSin!taskq_thread_wait+0xb2

ZFS spawns a bunch of taskq threads, that idle waiting for something to do: https://github.com/openzfsonwindows/ZFSin/blob/master/ZFSin/zfs/module/zfs/spa.c#L157

and when they are given a task, taskq_dispatch*() it runs it, then returns to idle, ready to do more work.

The taskq "number of threads" is based on a bunch of things, number of cores, writers, each pool has a number, each dataset, etc. But it is worth noting those values are all "Solaris" defaults. I have not yet looked at what tweaks Windows might want. We might be over-creating threads, but Windows does seem pretty decent at threads.

One txg will fill with data until it hits the limit for one txg, then it will wait for it to quiesce, and spa_sync to complete, then it starts again. The txg limits have made knobs you can tune and tweak. https://www.delphix.com/blog/delphix-engineering/zfs-fundamentals-transaction-groups

I'm no experts with those tunables - so experiment and report what you find.

There is also a write-throttle, which I believe was added so there is always a little room for other datasets to not be starved out.

Are we making progress at least?

imtiazdc commented 4 years ago

Compression seems to be happening although I didn't turn it on (I am using default settings). Is this expected?

# Child-SP          RetAddr           Call Site
00 fffff803`e632cd88 fffff803`e463b49c nt!DbgBreakPointWithStatus
01 fffff803`e632cd90 fffff803`e4638b29 nt!KeAccumulateTicks+0x39c
02 fffff803`e632cdf0 fffff803`e4e28366 nt!KeClockInterruptNotify+0x469
03 fffff803`e632cf40 fffff803`e46b35f6 hal!HalpTimerClockInterrupt+0x56
04 fffff803`e632cf70 fffff803`e4766d8a nt!KiCallInterruptServiceRoutine+0x106
05 fffff803`e632cfb0 fffff803`e4767277 nt!KiInterruptSubDispatchNoLockNoEtw+0xea
06 ffff9e80`68097270 fffff80c`398408b3 nt!KiInterruptDispatchNoLockNoEtw+0x37
07 ffff9e80`68097400 fffff80c`39841890 ZFSin!LZ4_compressCtx+0x1c3 [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\lz4.c @ 519] 
08 ffff9e80`680974e0 fffff80c`39841624 ZFSin!real_LZ4_compress+0xf0 [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\lz4.c @ 858] 
09 ffff9e80`68097540 fffff80c`39967f39 ZFSin!lz4_compress_zfs+0xa4 [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\lz4.c @ 61] 
0a ffff9e80`68097590 fffff80c`3994f79e ZFSin!zio_compress_data+0x1a9 [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\zio_compress.c @ 126] 
0b ffff9e80`68097620 fffff80c`39958812 ZFSin!zio_write_compress+0x99e [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\zio.c @ 1618] 
0c ffff9e80`68097a40 fffff80c`396e16cf ZFSin!__zio_execute+0x342 [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\zio.c @ 1997] 
0d ffff9e80`68097ad0 fffff803`e471226d ZFSin!taskq_thread+0x48f [C:\Users\imohammad\Downloads\ZFSin\ZFSin\spl\module\spl\spl-taskq.c @ 1610] 
0e ffff9e80`68097b90 fffff803`e476c896 nt!PspSystemThreadStartup+0x41
0f ffff9e80`68097be0 00000000`00000000 nt!KiStartSystemThread+0x16

imtiazdc commented 4 years ago

With sync=standard the writes get bunched together and flushed with spa_sync(txg). In theory, this should get you better/faster writes.

What is the downside of using "standard" setting for sync? Can there be data loss?

lundman commented 4 years ago

Yes, up to one txg. The pool will always be consistent (either pointing to txg-1 data, or to txg data, since uber block is updated last). But the last 5s of writes may not be there.

imtiazdc commented 4 years ago

OK. Thanks for clarifying. Is 1 txg always 5s worth of data or can it be flushed before 5s if it hits a size limit?

lundman commented 4 years ago

Take a look at:

https://github.com/openzfsonwindows/ZFSin/blob/6304af84c0092c1453a4ddc5779c10a968b336ce/ZFSin/zfs/module/zfs/dsl_pool.c#L111

https://github.com/openzfsonwindows/ZFSin/blob/0766fda9a8875595c492e64f64273862462672e2/ZFSin/zfs/module/zfs/dmu_tx.c#L755

imtiazdc commented 4 years ago

@lundman Could you also comment on the "compression" seen in the stacktrace?

I will put together all my findings based on above comments and let you know. We are making good progress.

lundman commented 4 years ago

It would be interesting to know if zfs get compression says anything at all? I know if lz4 is enabled as feature, zfs will use it for metadata, but not the zvol data itself.

imtiazdc commented 4 years ago

It would be interesting to know if zfs get compression says anything at all?

Right now, I have the debugger attached and target paused for some investigation. What can I inspect through the debugger to find out if compression is on/off?

lundman commented 4 years ago

there's no rush - if its doing lz4 when it shouldn't, we'll come across it again :)

imtiazdc commented 4 years ago

@lundman Glad to inform that I am not seeing writes choking on physical (512b sector size) hard drives attached to a VM. Have been running Iometer workload for more than 10 hours.

Here's my zpool/zvol config:

zpool create -o ashift=12 tank PHYSICALDRIVE1
zfs create -s -V 7e -o volblocksize=4k -o sync=always tank/zvol

Looking forward to merge your change in our fork once it is merged in upstream.

openzfsonwindows / ZFSin

All writers block on same hash table mutex hurting performance #238