Open imtiazdc opened 4 years ago
As can be seen from stack traces below, the writer threads are waiting for the mutex in open context (and not syncing context).
``` 4.001524 ffffc78fb6226040 0000001 RUNNING nt!DbgBreakPointWithStatus nt!KeAccumulateTicks+0x39c nt!KeClockInterruptNotify+0xb8 hal!HalpTimerClockIpiRoutine+0x15 nt!KiCallInterruptServiceRoutine+0x106 nt!KiInterruptSubDispatchNoLockNoEtw+0xea nt!KiInterruptDispatchNoLockNoEtw+0x37 ZFSin!spl_mutex_enter+0x14b ZFSin!dbuf_find+0x105 ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 ```
``` 4.000ee4 ffffc78fb60c1800 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.001628 ffffc78fb5c07800 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.001688 ffffc780003e4040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.001820 ffffc78000f55040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.001840 ffffc7800230e040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.00186c ffffc78002302040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 4.0015c8 ffffc7801ade5040 0000000 Blocked nt!KiSwapContext+0x76 nt!KiSwapThread+0x17d nt!KiCommitThreadWait+0x14f nt!KeWaitForSingleObject+0x377 ZFSin!spl_mutex_enter+0x12b ZFSin!dbuf_find+0x7c ZFSin!__dbuf_hold_impl+0x23f ZFSin!dbuf_hold_impl+0x87 ZFSin!dbuf_hold_level+0x50 ZFSin!dbuf_hold+0x29 ZFSin!dnode_hold_impl+0x40c ZFSin!dnode_hold+0x44 ZFSin!dmu_tx_hold_object_impl+0x44 ZFSin!dmu_tx_hold_write+0x176 ZFSin!zvol_write_win+0x1f4 ZFSin!wzvol_WkRtn+0x267 ZFSin!wzvol_GeneralWkRtn+0x6d nt!IopProcessWorkItem+0x12a nt!ExpWorkerThread+0xe9 nt!PspSystemThreadStartup+0x41 nt!KiStartSystemThread+0x16 ```
This thread:
4.001524 ffffc78fb6226040 0000001 RUNNING nt!DbgBreakPointWithStatus
nt!KeAccumulateTicks+0x39c
nt!KeClockInterruptNotify+0xb8
hal!HalpTimerClockIpiRoutine+0x15
nt!KiCallInterruptServiceRoutine+0x106
nt!KiInterruptSubDispatchNoLockNoEtw+0xea
nt!KiInterruptDispatchNoLockNoEtw+0x37
ZFSin!spl_mutex_enter+0x14b
is it still alive? It eventually finishes the task and releases the mutex? It's just that the stack looks exactly like that when you get DPC_WATCHDOG_VIOLATION
and it has actually crashed.
is it still alive? Yes It eventually finishes the task and releases the mutex? Yes
There is no crash, just that the write performance is significantly degraded. In perfmon, we see writes happening at like 10 MBps on zvol for couple of seconds, then it goes back to 0 for a few seconds, then again a burst of writes, then back to 0, etc, ...
I am suspecting the way locks are acquired is hurting the throughput.
Not particularly familiar with this code, but all 8 threads go down from zvol_write_win()
:
dmu_tx_hold_write(tx, ZVOL_OBJ, off, bytes);
Where ZVOL_OBJ == 1
enters err = dnode_hold(os, object, FTAG, &dn);
Where only the object
passed along, not offset
or bytes
. (length)
We convert it:
mdn = DMU_META_DNODE(os);
blk = dbuf_whichblock(mdn, 0, object * sizeof (dnode_phys_t));
which is computed to the same value each time.
Even though there are 8 threads writing to different locations, they are all blocked by one mutex due to;
uint64_t hv = dbuf_hash(os, obj, level, blkid);
As all arguments are not changing, the result is also computed to the same value each time.
@ahrens Did I miss anything? Are there any mitigations for multiple writing threads?
I would recommend using dmu_tx_hold_write_by_dnode()
instead of dmu_tx_hold_write()
. Where the dnode_t*
is saved in a zvol-specific struct when the zvol is opened. We should make the same change in openzfs as well.
Unfortunately, the structs are different in Linux and Windows. Specifically, dnode_t* is missing from Windows' zvol_state_t. Windows: https://github.com/openzfsonwindows/ZFSin/blob/master/ZFSin/zfs/include/sys/zvol.h#L59 Linux: https://github.com/openzfs/zfs/blob/master/include/sys/zvol_impl.h#L41
Thoughts??
Awesome, cheers @ahrens .
@imtiazdc Let me work on porting the PR over, change zvol_state to our needs and work out how to open the zvol by saving the dnode.
OK it would look something like this (2 commits): https://github.com/openzfsonwindows/ZFSin/tree/bydnode/ZFSin
15e6376e87330348c9a8f013e42bbedc949c3a71 65faac1ab1468a43fea91a1933a4a95c1fc7f094
Took the chance to clean things up again, we don't actually need dmu_(read|write)_win()
style functions, that is something OsX needs to pass the C++ class around.
I've not had a chance to test it much.
Thanks @ahrens @lundman for the quick turnaround!
Above is a snapshot with these changes. To the left is write performance on zvol (8GB, thick provisioned, default settings, no dedup, etc). To the right is write performance on zpool (used Fixed Disk Type when serving the VHD from Hyper-V to this VM).
I still see writes dropping to 0 on zvol (on zpool the writes are still continuos). However, those 0 writes appear very less frequently compared to how it was without this latest change. Overall, the change looks good.
We dedup ON, we can see significantly lower writes on zpool!
@lundman I just picked all your changes from bydnode branch. Here are some observations:
4.0014d8 ffffd709728c3040 0009f1a Blocked nt!KiSwapContext+0x76
nt!KiSwapThread+0x17d
nt!KiCommitThreadWait+0x14f
nt!KeWaitForMultipleObjects+0x1fe
ZFSin!spl_cv_wait+0xf3
ZFSin!taskq_thread_wait+0xb2
ZFSin!taskq_thread+0x39e
nt!PspSystemThreadStartup+0x41
nt!KiStartSystemThread+0x16
and 250 threads are in blocked state with below stacktrace:
4.001648 ffffd7097258b040 000018b Blocked nt!KiSwapContext+0x76
nt!KiSwapThread+0x17d
nt!KiCommitThreadWait+0x14f
nt!KeWaitForMultipleObjects+0x1fe
ZFSin!spl_cv_wait+0xf3
ZFSin!txg_wait_synced+0x13b
ZFSin!dmu_tx_wait+0x2b6
ZFSin!dmu_tx_assign+0x17b
ZFSin!zvol_write+0x235
ZFSin!wzvol_WkRtn+0x419
ZFSin!wzvol_GeneralWkRtn+0x6d
nt!IopProcessWorkItem+0x12a
nt!ExpWorkerThread+0xe9
nt!PspSystemThreadStartup+0x41
nt!KiStartSystemThread+0x16
4.00160c ffffd7097290f080 0000191 Blocked nt!KiSwapContext+0x76
nt!KiSwapThread+0x17d
nt!KiCommitThreadWait+0x14f
nt!KeWaitForMultipleObjects+0x1fe
ZFSin!spl_cv_wait+0xf3
ZFSin!zio_wait+0x1b5
ZFSin!dsl_pool_sync+0x2e7
ZFSin!spa_sync+0xb23
ZFSin!txg_sync_thread+0x3f9
nt!PspSystemThreadStartup+0x41
nt!KiStartSystemThread+0x16
Does that give any clue as to what might have gone wrong?
Stacks of all threads running zfsin code if that helps: threads.txt
Update: going by the discussion on write amplification issue, I specified the ashift and volblocksize explicitly and the write amplification does look under control. However, the write choppiness remains. zpool create -o ashift=12 tank PHYSICALDRIVE2 zfs create -s -V 8g -o volblocksize=4k tank/zvol
I will break in to the debugger after a few hours of Iometer writes and see what is happening with the count of threads and locking.
Ok, lets see. The sync
commit changes was that "sync
was always true
", no matter what. With sync writes on, zvol will wait for the data it reach the disk, before continuing. Safe, but slow. With sync=standard
the writes get bunched together and flushed with spa_sync(txg). In theory, this should get you better/faster writes.
But more importantly, it should be user selectable now, ie, zfs set sync=always pool/volume
should bring you the "smooth writes" as before.
Now, I've often wondered if there is some issue in mutex / condvar code, could it be missing signals/broadcasts? Sometimes it appears to have "supposed to have been signaled" but sits there doing nothing, until the next one comes in. Is this something you have noticed?
ZFSin!taskq_thread_wait+0xb2
ZFS spawns a bunch of taskq threads, that idle waiting for something to do: https://github.com/openzfsonwindows/ZFSin/blob/master/ZFSin/zfs/module/zfs/spa.c#L157
and when they are given a task, taskq_dispatch*()
it runs it, then returns to idle, ready to do more work.
The taskq "number of threads" is based on a bunch of things, number of cores, writers, each pool has a number, each dataset, etc. But it is worth noting those values are all "Solaris" defaults. I have not yet looked at what tweaks Windows might want. We might be over-creating threads, but Windows does seem pretty decent at threads.
One txg will fill with data until it hits the limit for one txg, then it will wait for it to quiesce, and spa_sync to complete, then it starts again. The txg limits have made knobs you can tune and tweak. https://www.delphix.com/blog/delphix-engineering/zfs-fundamentals-transaction-groups
I'm no experts with those tunables - so experiment and report what you find.
There is also a write-throttle, which I believe was added so there is always a little room for other datasets to not be starved out.
Are we making progress at least?
Compression seems to be happening although I didn't turn it on (I am using default settings). Is this expected?
# Child-SP RetAddr Call Site
00 fffff803`e632cd88 fffff803`e463b49c nt!DbgBreakPointWithStatus
01 fffff803`e632cd90 fffff803`e4638b29 nt!KeAccumulateTicks+0x39c
02 fffff803`e632cdf0 fffff803`e4e28366 nt!KeClockInterruptNotify+0x469
03 fffff803`e632cf40 fffff803`e46b35f6 hal!HalpTimerClockInterrupt+0x56
04 fffff803`e632cf70 fffff803`e4766d8a nt!KiCallInterruptServiceRoutine+0x106
05 fffff803`e632cfb0 fffff803`e4767277 nt!KiInterruptSubDispatchNoLockNoEtw+0xea
06 ffff9e80`68097270 fffff80c`398408b3 nt!KiInterruptDispatchNoLockNoEtw+0x37
07 ffff9e80`68097400 fffff80c`39841890 ZFSin!LZ4_compressCtx+0x1c3 [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\lz4.c @ 519]
08 ffff9e80`680974e0 fffff80c`39841624 ZFSin!real_LZ4_compress+0xf0 [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\lz4.c @ 858]
09 ffff9e80`68097540 fffff80c`39967f39 ZFSin!lz4_compress_zfs+0xa4 [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\lz4.c @ 61]
0a ffff9e80`68097590 fffff80c`3994f79e ZFSin!zio_compress_data+0x1a9 [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\zio_compress.c @ 126]
0b ffff9e80`68097620 fffff80c`39958812 ZFSin!zio_write_compress+0x99e [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\zio.c @ 1618]
0c ffff9e80`68097a40 fffff80c`396e16cf ZFSin!__zio_execute+0x342 [C:\Users\imohammad\Downloads\ZFSin\ZFSin\zfs\module\zfs\zio.c @ 1997]
0d ffff9e80`68097ad0 fffff803`e471226d ZFSin!taskq_thread+0x48f [C:\Users\imohammad\Downloads\ZFSin\ZFSin\spl\module\spl\spl-taskq.c @ 1610]
0e ffff9e80`68097b90 fffff803`e476c896 nt!PspSystemThreadStartup+0x41
0f ffff9e80`68097be0 00000000`00000000 nt!KiStartSystemThread+0x16
With sync=standard the writes get bunched together and flushed with spa_sync(txg). In theory, this should get you better/faster writes.
What is the downside of using "standard" setting for sync? Can there be data loss?
Yes, up to one txg. The pool will always be consistent (either pointing to txg-1 data, or to txg data, since uber block is updated last). But the last 5s of writes may not be there.
OK. Thanks for clarifying. Is 1 txg always 5s worth of data or can it be flushed before 5s if it hits a size limit?
@lundman Could you also comment on the "compression" seen in the stacktrace?
I will put together all my findings based on above comments and let you know. We are making good progress.
It would be interesting to know if zfs get compression
says anything at all? I know if lz4
is enabled as feature, zfs will use it for metadata, but not the zvol data itself.
It would be interesting to know if zfs get compression says anything at all?
Right now, I have the debugger attached and target paused for some investigation. What can I inspect through the debugger to find out if compression is on/off?
there's no rush - if its doing lz4 when it shouldn't, we'll come across it again :)
@lundman Glad to inform that I am not seeing writes choking on physical (512b sector size) hard drives attached to a VM. Have been running Iometer workload for more than 10 hours.
Here's my zpool/zvol config:
zpool create -o ashift=12 tank PHYSICALDRIVE1
zfs create -s -V 7e -o volblocksize=4k -o sync=always tank/zvol
Looking forward to merge your change in our fork once it is merged in upstream.
threads.txt
I see a performance issue with all writers trying to acquire same mutex in dbuf_hash_table while writing to zvol (RAW disk). I am using Iometer to generate workload. During live debugging, I found that there are 8 threads executing dbuf_find (as part of zvol_write_win) out of which 7 are waiting for the mutex held by the 8th thread. The output of !stacks 2 zfsin is attached.
Here's the problem, all 8 threads have same input parameters (shown below) to the function dbuf_find - 0xffffc78f`a4734180 0 0 0 which is causing them to map to the same mutex within the hash table.
Is this expected and mandatory in zvol writes? Are there ways to improve?
Below is my setup: VM configuration: 4 vCPU 8 GB RAM zvol settings: 8GB (thick provisioned) Everything is default. Dedup=off, Compression=off, ... Iometer settings: 4 workers Data pattern = full random No of outstanding IOs = 64 Access spec = 4 KiB; 0% Read; 100% random
Thanks, Imtiaz