Closed omarkilani closed 1 year ago
I rebooted the VM, ran scrub
and everything was fine on the pool.
I can write to it normally and what not. Then I ran pg_dump
again and z_wr_iss_h
hung again.
I've changed the VM from an AMD Milan CPU to a Intel Ice Lake CPU:
model name : Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz
And weirdly... pg_dump
looks to be working normally there.
It has the exact same boot volume, kernel, zfs module, pool, etc.
The only difference is the CPU.
Pretty weird.
Well... the pg_dump
completed successfully on the Ice Lake processor. Dumped 5Tb of data (2.8Tb lz4 compressed) from ZFS into about 600Gb of gzip'ed data on XFS.
I don't even know where to begin with "AMD EPYC processor causes zfs to hang". Sigh.
I think I know (since I don't know of any other virtualization solutions that let you flip that bit so easily), but precisely what virtualization tool and version are you using? And do you have any idea how far into the pg_dump
it gets before hanging? (I presume the version of postgres is whatever's latest stable in RHEL 8.4?)
Hey @rincebrain,
These are VMs running at Oracle Cloud (where I was doing the arm64 testing before).
It's kvm based:
[ 0.000000] efi: EFI v2.70 by EDK II
[ 0.000000] efi: SMBIOS=0xbf742000 ACPI=0xbf97e000 ACPI 2.0=0xbf97e014 MEMATTR=0xbe0fc018 MOKvar=0xbf73c000
[ 0.000000] secureboot: Secure boot disabled
[ 0.000000] SMBIOS 2.8 present.
[ 0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.4.1 12/03/2020
[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000] kvm-clock: cpu 0, msr b51401001, primary cpu clock
[ 0.000000] kvm-clock: using sched offset of 9596802563 cycles
[ 0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.000000] kvm-guest: KVM setup pv remote TLB flush
[ 0.000000] kvm-guest: setup PV sched yield
[ 0.000000] kvm-guest: stealtime: cpu 0, msr 17cfe2c080
[ 0.000000] kvm-guest: PV spinlocks enabled
[ 0.001311] kvm-guest: setup PV IPIs
[ 0.001000] kvm-clock: cpu 1, msr b51401041, secondary cpu clock
[ 0.020047] kvm-guest: stealtime: cpu 1, msr 17cfe6c080
[ 0.001000] kvm-clock: cpu 2, msr b51401081, secondary cpu clock
[ 0.020385] kvm-guest: stealtime: cpu 2, msr 17cfeac080
[ 0.001000] kvm-clock: cpu 3, msr b514010c1, secondary cpu clock
[ 0.021063] kvm-guest: stealtime: cpu 3, msr 17cfeec080
[ 0.001000] kvm-clock: cpu 4, msr b51401101, secondary cpu clock
[ 0.021341] kvm-guest: stealtime: cpu 4, msr 17cff2c080
[ 0.001000] kvm-clock: cpu 5, msr b51401141, secondary cpu clock
[ 0.021341] kvm-guest: stealtime: cpu 5, msr 17cff6c080
[ 0.001000] kvm-clock: cpu 6, msr b51401181, secondary cpu clock
[ 0.022299] kvm-guest: stealtime: cpu 6, msr 17cffac080
[ 0.001000] kvm-clock: cpu 7, msr b514011c1, secondary cpu clock
[ 0.022501] kvm-guest: stealtime: cpu 7, msr 17cffec080
[ 0.001000] kvm-clock: cpu 8, msr b51401201, secondary cpu clock
[ 0.023334] kvm-guest: stealtime: cpu 8, msr 17d002c080
[ 0.001000] kvm-clock: cpu 9, msr b51401241, secondary cpu clock
[ 0.023367] kvm-guest: stealtime: cpu 9, msr 17d006c080
[ 0.001000] kvm-clock: cpu 10, msr b51401281, secondary cpu clock
[ 0.024193] kvm-guest: stealtime: cpu 10, msr 17d00ac080
[ 0.001000] kvm-clock: cpu 11, msr b514012c1, secondary cpu clock
[ 0.024374] kvm-guest: stealtime: cpu 11, msr 17d00ec080
[ 0.161088] clocksource: Switched to clocksource kvm-clock
[ 0.928366] systemd[1]: Detected virtualization kvm.
[ 11.028118] systemd[1]: Detected virtualization kvm.
This is a rebuilt SRPM from the PGDG project:
Since it's x86_64 I just used the Postgres 13 RPMs we use in production. We have some custom Postgres extensions that we add to the RPMs but they're not used in the pg_dump
process.
The pg_dump
got to 330Gb the first time, and 390Gb the second time.
Okay, so the other weird thing is that a pg_restore
onto the ZFS file system from the XFS file system works fine on the EPYC VM. That's 2.8TB lz4 compressed / 5TB logicalused.
The pg_dump
is a read workload and I had N = number of CPU dump jobs running, which is fairly intensive (the pg_restore
is also N = number of CPUs and that seems okay).
So maybe the database went to write something out while that was going on and z_wr_iss_h
/ txg_quiesce
hung for whatever reason. Some weird locking thing on EPYC? 🤷♂️
Yeah, I was assuming similarly that it was an issue with heavy reads and then some writes while they go, and planning to try experimenting with 400+ GiB pgsql databases (though I cannot, as it turns out, map 64 logical cores to 64 distinct real cores, locally), but haven't gotten a chance to yet.
I'm slightly surprised more people haven't screamed if there's a problem common enough for you to hit, even if Epyc-specific, unless you got really lucky. Oh well, more poking around will answer that.
Milan is fairly new and has had some major internal changes so I’m going to try it on the previous gen EPYC (Rome) at both OCI and GCP.
It might be possible for you to generate enough data with pgbench -i -s NNNNNN
where the scale factor is huge — set fsync = off
in the postgresql.conf
beforehand to make that go a little faster.
I think a value of 100 generates something like 25GB of real data.
Okay... Something Weird (tm) is going on with these Milan machines.
The kernel is spending significant time in native_write_msr
, native_apic_msr_eoi_write
and native_read_msr
.
Maybe something is going on with locking... for some reason. Or the hypervisor isn't happy on Milan for whatever reason.
Maybe interrupts / X2APIC.
Can you try scaling down the {pg_dump threads, that + actual VM cores} and see if performance actually improves with one or both?
Yup. I'm just going to finish this test on OCI Milan VM / OCI Rome VM / GCP n2d Rome VM first since I happened to have the same data in both places.
FWIW, I'm using 12 CPU VMs for Rome/Milan and Ice Lake at OCI and running with -j 12.
Just going by my possibly hazy memory... the output of perf top
while running on the Milan vs the Ice Lake looks very different. I'll have to double check that.
(I'm just using this ticket to keep all the data in one place -- feel free to ignore my rants etc.)
On GCP, I'm using a n2d-highmem-16
shape since it's the closest to the OCI E4.Flex shape with 12 CPUs and 96GB of RAM.
I'm trying to set up the same environment for ZFS by attaching 16 NVMe disks to it and using them all in a pool. On OCI I have 12 members in the pool x 500GB, on GCP I have 16 members in the pool x 375GB, as you can only run multiples of 8.
I have to copy the data off block storage since that's on XFS onto the zpool. That's taking a while because those NVMe disks are extremely slow for writes.
I'm running the rhel-8-v20210609
image with the same 4.18.0-305.3.1.el8_4.x86_64
kernel.
All RPMs used are the same across both OCI / GCP (zfs, Postgres, etc).
Once the data is copied over I'll run pg_dump
with the same parameters as I used before but I'll use -j 16
instead.
I'm not sure using less cores will change much as it looks like it's only using 7 cores at that point.
[root@instance-20210606-1831 ~]# du -sh /dump/pg3/
372G /dump/pg3/
[root@instance-20210606-1831 ~]# ps auxww | grep COPY
postgres 1612463 42.4 0.2 34719172 243888 ? Ds 02:23 50:51 postgres: postgres x [local] COPY
postgres 1612466 48.9 0.2 34719288 255328 ? Ds 02:23 58:40 postgres: postgres x [local] COPY
postgres 1612469 17.1 0.2 34719176 243828 ? Ds 02:23 20:30 postgres: postgres x [local] COPY
postgres 1612472 30.6 0.6 34723496 650632 ? Ds 02:23 36:40 postgres: postgres x [local] COPY
postgres 1612473 32.2 0.2 34719176 243816 ? Ds 02:23 38:37 postgres: postgres x [local] COPY
postgres 1612476 14.0 19.8 34777296 19582612 ? Ds 02:23 16:48 postgres: postgres x [local] COPY
postgres 1612478 18.1 17.6 34782452 17420144 ? Rs 02:23 21:41 postgres: postgres x [local] COPY
These are my pg_dump
args:
/usr/pgsql-13/bin/pg_dump -Fd -f /dump/pg3 -Z 1 -v -j 12 --no-sync x
/dump
is a XFS file system mounted with -o noatime
.
The source database is Postgres 11.12, though. It shouldn't matter but for the sake of presenting all the data points.
The pg_dump
was successful on GCP, but... it was also successful at OCI for the first time.
The two differences at OCI vs the other 3 times I ran it:
I'd pg_restore
'ed the dump into a new Postgres 13.3 database on the same ZFS filesystem, so I had written to roughly 90% of capacity.
I increased the storage performance slider at OCI from 10 to 20. This is like an artificial performance limiter so you pay more for the larger number/bandwidth/IOPs. The slider maps to GCP performance levels roughly like:
OCI 0 = GCP Standard OCI 10 = GCP Balanced OCI 20 = GCP SSD
I have a snapshot of the pool in the exact state where it was hanging, so I'll restore back to that after I adjust the slider down to 10 again.
Rather annoyingly for repro purposes, Oracle upped the performance of the various slider settings... today:
This could explain why it magically started working. I'm going to try a couple of other things but... I don't think ZFS should lock up in general. lol.
Sigh...
I would tend to agree, re: lockups. (Lest you think we're somehow on different pages here.)
I restored the snapshot so I can do this with what used to break it before. I'm running that pg_dump
now on the AMD Milan VM at OCI.
Every 5 seconds or so the postgres walwriter process writes out something like 8kb (though strace's write
shows 16kb sometimes, for whatever reason) to the WAL.
epoll_create1(EPOLL_CLOEXEC) = 4
epoll_ctl(4, EPOLL_CTL_ADD, 11, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587608, u64=31587608}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587632, u64=31587632}}) = 0
epoll_wait(4, [], 1, 200) = 0
close(4) = 0
write(3, "\230\320\5\0\1\0\0\0\0@\377\222e\210\0\0E\32\0\0\0\0\0\0\0\0 XXXXX"..., 16384) = 16384
fdatasync(3) = 0
epoll_create1(EPOLL_CLOEXEC) = 4
epoll_ctl(4, EPOLL_CTL_ADD, 11, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587608, u64=31587608}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587632, u64=31587632}}) = 0
epoll_wait(4, [], 1, 200) = 0
close(4) = 0
write(3, "\230\320\5\0\1\0\0\0\0\200\377\222e\210\0\0\371\30\0\0\0\0\0 XXXXX"..., 8192) = 8192
fdatasync(3) = 0
epoll_create1(EPOLL_CLOEXEC) = 4
epoll_ctl(4, EPOLL_CTL_ADD, 11, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587608, u64=31587608}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587632, u64=31587632}}) = 0
epoll_wait(4, [], 1, 200) = 0
close(4) = 0
write(3, "\230\320\5\0\1\0\0\0\0\240\377\222e\210\0\0%\30\0\0\0\0\0\0 XXXXXX"..., 8192) = 8192
fdatasync(3) = 0
And fd/3 is:
lrwx------ 1 postgres postgres 64 Jun 10 23:17 /proc/496961/fd/3 -> /tank/db/x/pg_wal/000000010000886500000093
So I think that's the kind of thing that got it stuck before. It's not high write traffic at all, but it does call fdatasync
on the fd.
Okay, I can no longer reproduce this on the exact same VM setup with the exact same data -- the upped storage performance limits seem to have played a role in that.
I'm going to retry again with the slowest VPU thing they have. Unfortunately that one is literal spinning disks and it's taking a while for it to move the data across. But should be a good test.
I managed to repro it with a fio
test case, woo. 🎉
I got to 416GB of dumped data but I triggered the exact same lockup.
Here's the fio script, it simulates the postgres walwriter workload of 8kb write
s and a fdatasync
on the fd with 16MB files:
[global]
ioengine=sync
bs=8K
iodepth=1
fdatasync=1
group_reporting
time_based
runtime=6000
numjobs=12
name=raw-write
rw=write
[job1]
filename=/tank/db/foo
size=16MB
You run this alongside the pg_dump
and sooner or later you trigger...
[28016.449085] INFO: task z_wr_iss_h:9261 blocked for more than 120 seconds.
[28016.455939] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.458492] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.461122] z_wr_iss_h D 0 9261 2 0x80004080
[28016.462863] Call Trace:
[28016.463943] __schedule+0x2c4/0x700
[28016.465321] ? wbt_exit+0x30/0x30
[28016.466676] ? __wbt_done+0x30/0x30
[28016.469090] schedule+0x38/0xa0
[28016.471471] io_schedule+0x12/0x40
[28016.473133] rq_qos_wait+0xfa/0x170
[28016.474556] ? karma_partition+0x1e0/0x1e0
[28016.476261] ? wbt_exit+0x30/0x30
[28016.477845] wbt_wait+0x99/0xe0
[28016.479390] __rq_qos_throttle+0x23/0x30
[28016.480970] blk_mq_make_request+0x12d/0x5b0
[28016.482775] generic_make_request+0x2db/0x340
[28016.485011] ? abd_to_buf+0xe/0x20 [zfs]
[28016.487466] ? abd_bio_map_off+0x1da/0x210 [zfs]
[28016.490301] submit_bio+0x3c/0x160
[28016.492236] __vdev_disk_physio.constprop.0+0x342/0x430 [zfs]
[28016.494420] vdev_disk_io_start+0x8d/0x370 [zfs]
[28016.496037] zio_vdev_io_start+0xf8/0x340 [zfs]
[28016.497436] zio_nowait+0xb4/0x1a0 [zfs]
[28016.498961] vdev_mirror_io_start+0xa2/0x250 [zfs]
[28016.500774] zio_vdev_io_start+0x2c2/0x340 [zfs]
[28016.502390] zio_execute+0x81/0x120 [zfs]
[28016.503882] taskq_thread+0x2f0/0x530 [spl]
[28016.505363] ? wake_up_q+0x80/0x80
[28016.506733] ? zio_gang_tree_free+0x60/0x60 [zfs]
[28016.508289] ? taskq_thread_spawn+0x50/0x50 [spl]
[28016.510009] kthread+0x116/0x130
[28016.511202] ? kthread_flush_work_fn+0x10/0x10
[28016.512842] ret_from_fork+0x22/0x40
[28016.514734] INFO: task txg_quiesce:10004 blocked for more than 120 seconds.
[28016.518027] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.521945] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.524217] txg_quiesce D 0 10004 2 0x80004080
[28016.526047] Call Trace:
[28016.527039] __schedule+0x2c4/0x700
[28016.528116] ? kfree+0x40b/0x450
[28016.529107] schedule+0x38/0xa0
[28016.530051] cv_wait_common+0xfb/0x130 [spl]
[28016.531215] ? finish_wait+0x80/0x80
[28016.532617] txg_quiesce+0x1d3/0x260 [zfs]
[28016.533706] txg_quiesce_thread+0xf9/0x180 [zfs]
[28016.534833] ? txg_quiesce+0x260/0x260 [zfs]
[28016.535863] ? __thread_exit+0x20/0x20 [spl]
[28016.536907] thread_generic_wrapper+0x6f/0x80 [spl]
[28016.538036] kthread+0x116/0x130
[28016.538872] ? kthread_flush_work_fn+0x10/0x10
[28016.540056] ret_from_fork+0x22/0x40
[28016.541027] INFO: task fio:3958565 blocked for more than 120 seconds.
[28016.542512] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.544334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.546814] fio D 0 3958565 3957974 0x00004080
[28016.549112] Call Trace:
[28016.549885] __schedule+0x2c4/0x700
[28016.550870] ? mutex_lock+0xe/0x30
[28016.551859] schedule+0x38/0xa0
[28016.552679] cv_wait_common+0xfb/0x130 [spl]
[28016.553775] ? finish_wait+0x80/0x80
[28016.554824] zil_commit_waiter+0xc0/0x150 [zfs]
[28016.556051] zil_commit_impl+0x16f/0x240 [zfs]
[28016.557413] zfs_fsync+0x70/0xf0 [zfs]
[28016.558768] zpl_fsync+0x67/0xa0 [zfs]
[28016.559959] ? __audit_syscall_exit+0x249/0x2a0
[28016.561112] do_fsync+0x38/0x70
[28016.562181] __x64_sys_fdatasync+0x13/0x20
[28016.563461] do_syscall_64+0x5b/0x1a0
[28016.564687] entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.566206] RIP: 0033:0x7f0f4fd8c56f
[28016.567299] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.568686] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.570452] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.572179] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.573876] RBP: 00007f0ee5e53000 R08: 00007ffdfdfa51b0 R09: 0000000002bb932e
[28016.575551] R10: 0000000002bb932e R11: 0000000000000293 R12: 0000000000000004
[28016.577233] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.579153] INFO: task fio:3958567 blocked for more than 120 seconds.
[28016.581019] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.582960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.584825] fio D 0 3958567 3957974 0x00004080
[28016.586365] Call Trace:
[28016.587266] __schedule+0x2c4/0x700
[28016.588307] ? mutex_lock+0xe/0x30
[28016.589357] schedule+0x38/0xa0
[28016.590228] cv_wait_common+0xfb/0x130 [spl]
[28016.591434] ? finish_wait+0x80/0x80
[28016.592500] zil_commit_waiter+0xc0/0x150 [zfs]
[28016.593797] zil_commit_impl+0x16f/0x240 [zfs]
[28016.595159] zfs_fsync+0x70/0xf0 [zfs]
[28016.596411] zpl_fsync+0x67/0xa0 [zfs]
[28016.597468] ? __audit_syscall_exit+0x249/0x2a0
[28016.598595] do_fsync+0x38/0x70
[28016.599560] __x64_sys_fdatasync+0x13/0x20
[28016.600681] do_syscall_64+0x5b/0x1a0
[28016.601673] entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.603006] RIP: 0033:0x7f0f4fd8c56f
[28016.604062] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.606699] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.608554] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.610267] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.611892] RBP: 00007f0ee5e97a18 R08: 00007ffdfdfa51b0 R09: 0000000002bb932c
[28016.613764] R10: 0000000002bb932c R11: 0000000000000293 R12: 0000000000000004
[28016.615669] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.617555] INFO: task fio:3958569 blocked for more than 120 seconds.
[28016.619367] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.621845] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.624002] fio D 0 3958569 3957974 0x00004080
[28016.625825] Call Trace:
[28016.627124] __schedule+0x2c4/0x700
[28016.628140] ? mutex_lock+0xe/0x30
[28016.629123] schedule+0x38/0xa0
[28016.630136] cv_wait_common+0xfb/0x130 [spl]
[28016.631385] ? finish_wait+0x80/0x80
[28016.632597] zil_commit_waiter+0xc0/0x150 [zfs]
[28016.634062] zil_commit_impl+0x16f/0x240 [zfs]
[28016.635601] zfs_fsync+0x70/0xf0 [zfs]
[28016.636980] zpl_fsync+0x67/0xa0 [zfs]
[28016.638231] ? __audit_syscall_exit+0x249/0x2a0
[28016.639679] do_fsync+0x38/0x70
[28016.640881] __x64_sys_fdatasync+0x13/0x20
[28016.642236] do_syscall_64+0x5b/0x1a0
[28016.643502] entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.644926] RIP: 0033:0x7f0f4fd8c56f
[28016.645866] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.647368] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.648971] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.650889] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.652427] RBP: 00007f0ee5edc430 R08: 00007ffdfdfa51b0 R09: 0000000002bb9334
[28016.653962] R10: 0000000002bb9334 R11: 0000000000000293 R12: 0000000000000004
[28016.655654] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.657371] INFO: task fio:3958570 blocked for more than 120 seconds.
[28016.658869] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.660724] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.662515] fio D 0 3958570 3957974 0x00004080
[28016.663915] Call Trace:
[28016.664673] __schedule+0x2c4/0x700
[28016.665714] schedule+0x38/0xa0
[28016.666654] cv_wait_common+0xfb/0x130 [spl]
[28016.667882] ? finish_wait+0x80/0x80
[28016.668986] zil_commit_waiter+0xc0/0x150 [zfs]
[28016.670310] zil_commit_impl+0x16f/0x240 [zfs]
[28016.671606] zfs_fsync+0x70/0xf0 [zfs]
[28016.672763] zpl_fsync+0x67/0xa0 [zfs]
[28016.674226] do_fsync+0x38/0x70
[28016.675469] __x64_sys_fdatasync+0x13/0x20
[28016.676919] do_syscall_64+0x5b/0x1a0
[28016.677976] entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.679403] RIP: 0033:0x7f0f4fd8c56f
[28016.680706] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.682379] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.684144] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.685890] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.687750] RBP: 00007f0ee5f20e48 R08: 00007ffdfdfa51b0 R09: 0000000002bb9334
[28016.689356] R10: 0000000002bb9334 R11: 0000000000000293 R12: 0000000000000004
[28016.690996] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.692600] INFO: task fio:3958573 blocked for more than 120 seconds.
[28016.694338] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.696557] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.698712] fio D 0 3958573 3957974 0x00004080
[28016.700407] Call Trace:
[28016.701434] __schedule+0x2c4/0x700
[28016.702744] ? mutex_lock+0xe/0x30
[28016.703973] schedule+0x38/0xa0
[28016.705124] cv_wait_common+0xfb/0x130 [spl]
[28016.706783] ? finish_wait+0x80/0x80
[28016.708072] zil_commit_waiter+0xc0/0x150 [zfs]
[28016.710225] zil_commit_impl+0x16f/0x240 [zfs]
[28016.712044] zfs_fsync+0x70/0xf0 [zfs]
[28016.713468] zpl_fsync+0x67/0xa0 [zfs]
[28016.714594] ? __audit_syscall_exit+0x249/0x2a0
[28016.715831] do_fsync+0x38/0x70
[28016.716771] __x64_sys_fdatasync+0x13/0x20
[28016.718147] do_syscall_64+0x5b/0x1a0
[28016.719520] entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.721425] RIP: 0033:0x7f0f4fd8c56f
[28016.722923] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.725131] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.727838] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.730170] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.732661] RBP: 00007f0ee5f65860 R08: 00007ffdfdfa51b0 R09: 0000000002bb932c
[28016.735274] R10: 0000000002bb932c R11: 0000000000000293 R12: 0000000000000004
[28016.737570] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.740189] INFO: task fio:3958574 blocked for more than 120 seconds.
[28016.742785] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.745553] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.748131] fio D 0 3958574 3957974 0x00004080
[28016.749949] Call Trace:
[28016.751033] __schedule+0x2c4/0x700
[28016.752479] ? mutex_lock+0xe/0x30
[28016.753888] schedule+0x38/0xa0
[28016.755208] cv_wait_common+0xfb/0x130 [spl]
[28016.756663] ? finish_wait+0x80/0x80
[28016.757837] zil_commit_waiter+0xc0/0x150 [zfs]
[28016.759437] zil_commit_impl+0x16f/0x240 [zfs]
[28016.761078] zfs_fsync+0x70/0xf0 [zfs]
[28016.762392] zpl_fsync+0x67/0xa0 [zfs]
[28016.763768] ? __audit_syscall_exit+0x249/0x2a0
[28016.765521] do_fsync+0x38/0x70
[28016.766773] __x64_sys_fdatasync+0x13/0x20
[28016.768313] do_syscall_64+0x5b/0x1a0
[28016.769717] entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.771559] RIP: 0033:0x7f0f4fd8c56f
[28016.772646] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.775088] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.778161] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.780521] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.783039] RBP: 00007f0ee5faa278 R08: 00007ffdfdfa51b0 R09: 0000000002bb932c
[28016.785692] R10: 0000000002bb932c R11: 0000000000000293 R12: 0000000000000004
[28016.788147] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.790588] INFO: task fio:3958576 blocked for more than 120 seconds.
[28016.792348] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.794298] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.796761] fio D 0 3958576 3957974 0x00004080
[28016.798344] Call Trace:
[28016.799236] __schedule+0x2c4/0x700
[28016.800475] ? mutex_lock+0xe/0x30
[28016.801519] schedule+0x38/0xa0
[28016.802672] cv_wait_common+0xfb/0x130 [spl]
[28016.804259] ? finish_wait+0x80/0x80
[28016.805466] zil_commit_waiter+0xc0/0x150 [zfs]
[28016.807268] zil_commit_impl+0x16f/0x240 [zfs]
[28016.809165] zfs_fsync+0x70/0xf0 [zfs]
[28016.810358] zpl_fsync+0x67/0xa0 [zfs]
[28016.811747] ? __audit_syscall_exit+0x249/0x2a0
[28016.813127] do_fsync+0x38/0x70
[28016.814039] __x64_sys_fdatasync+0x13/0x20
[28016.815382] do_syscall_64+0x5b/0x1a0
[28016.816522] entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.817880] RIP: 0033:0x7f0f4fd8c56f
[28016.819011] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.820726] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.823542] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.825779] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.827416] RBP: 00007f0ee5feec90 R08: 00007ffdfdfa51b0 R09: 0000000002bb932e
[28016.829073] R10: 0000000002bb932e R11: 0000000000000293 R12: 0000000000000004
[28016.830771] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.832418] INFO: task fio:3958577 blocked for more than 120 seconds.
[28016.833914] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.835877] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.837702] fio D 0 3958577 3957974 0x00004080
[28016.839188] Call Trace:
[28016.840073] __schedule+0x2c4/0x700
[28016.841097] ? mutex_lock+0xe/0x30
[28016.842127] schedule+0x38/0xa0
[28016.843128] cv_wait_common+0xfb/0x130 [spl]
[28016.844529] ? finish_wait+0x80/0x80
[28016.846295] zil_commit_waiter+0xc0/0x150 [zfs]
[28016.848088] zil_commit_impl+0x16f/0x240 [zfs]
[28016.849427] zfs_fsync+0x70/0xf0 [zfs]
[28016.850512] zpl_fsync+0x67/0xa0 [zfs]
[28016.851554] ? __audit_syscall_exit+0x249/0x2a0
[28016.852813] do_fsync+0x38/0x70
[28016.853824] __x64_sys_fdatasync+0x13/0x20
[28016.854903] do_syscall_64+0x5b/0x1a0
[28016.855874] entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.857163] RIP: 0033:0x7f0f4fd8c56f
[28016.858154] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.859788] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.861503] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.863144] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.864598] RBP: 00007f0ee60336a8 R08: 00007ffdfdfa51b0 R09: 0000000002bb932e
[28016.866016] R10: 0000000002bb932e R11: 0000000000000293 R12: 0000000000000004
[28016.867698] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
One interesting thing to note here is while sync'ed writes to the filesystem hang and e.g. zpool sync
hangs forever:
[root@instance-20210606-1831 ~]# ps auxwww | grep zpool
root 2976364 0.0 0.0 270084 6252 pts/1 D 05:33 0:00 zpool sync
[root@instance-20210606-1831 ~]# strace -p 2976364
strace: Process 2976364 attached
^C^C^C^C^C^C
The pg_dump
job is still making progress:
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 2.65T 3.16T 9.24K 0 152M 0
sdb 230G 266G 728 0 15.2M 0
sdc 247G 249G 735 0 12.6M 0
sdd 246G 250G 1022 0 16.6M 0
sde 247G 249G 926 0 14.1M 0
sdf 184G 312G 640 0 8.23M 0
sdg 252G 244G 669 0 14.3M 0
sdh 190G 306G 691 0 9.28M 0
sdi 258G 238G 830 0 14.0M 0
sdm 189G 307G 722 0 10.9M 0
sdj 244G 252G 887 0 13.5M 0
sdk 241G 255G 933 0 12.2M 0
sdl 190G 306G 671 0 10.7M 0
---------- ----- ----- ----- ----- ----- -----
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
tank 2.65T 3.16T 9.25K 0 75.4M 0
sdb 230G 266G 930 0 7.19M 0
sdc 247G 249G 922 0 6.76M 0
sdd 246G 250G 986 0 7.86M 0
sde 247G 249G 978 0 7.27M 0
sdf 184G 312G 699 0 5.76M 0
sdg 252G 244G 782 0 6.63M 0
sdh 190G 306G 739 0 6.39M 0
sdi 258G 238G 932 0 6.87M 0
sdm 189G 307G 537 0 4.45M 0
sdj 244G 252G 771 0 5.81M 0
sdk 241G 255G 811 0 6.91M 0
sdl 190G 306G 374 0 3.50M 0
---------- ----- ----- ----- ----- ----- -----
Something I meant to mention before but apparently did not is, since the top of your stack seems like it's in the writeback throttle Linux bolted on, you could try turning it off and seeing if you can still reproduce.
(I should warn you that I am not at all familiar with the machinery involved here, and am doing simple association. This might change nothing relevant. )
Alright, I'll try that. It looks sort of similar to that issue but not quite since that reporter's reads were blocked. FWIW the pg_dump
completed successfully even though writers were deadlocked.
I upgraded to rc7 just to rule out magic and now I know I can use fio it's much quicker, as the walwriter only writes very sporadically (I think that was the issue in getting a repro before).
It should be possible to simulate the pg_bench
workload with fio, too, since that's basically just sequential reads.
I'm not sure if it matters that it's written out to XFS kernel-wise but I don't think so since it goes through userspace to get processed first.
Alright, rc7 is still affected.
It took 860 seconds to trigger the deadlock with fio
, which is much faster than the 5 hours I was waiting before.
[ 860.170253] INFO: task z_wr_iss_h:5802 blocked for more than 120 seconds.
[ 860.182902] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[ 860.184791] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 860.186977] z_wr_iss_h D 0 5802 2 0x80004080
[ 860.188514] Call Trace:
[ 860.189296] __schedule+0x2c4/0x700
[ 860.190294] ? elv_rb_del+0x1f/0x30
[ 860.191303] ? wbt_exit+0x30/0x30
[ 860.192327] ? __wbt_done+0x30/0x30
[ 860.193355] schedule+0x38/0xa0
[ 860.194149] io_schedule+0x12/0x40
[ 860.195090] rq_qos_wait+0xfa/0x170
[ 860.195980] ? karma_partition+0x1e0/0x1e0
[ 860.197106] ? wbt_exit+0x30/0x30
[ 860.197993] wbt_wait+0x99/0xe0
[ 860.198782] __rq_qos_throttle+0x23/0x30
[ 860.199521] blk_mq_make_request+0x12d/0x5b0
[ 860.200646] generic_make_request+0x2db/0x340
[ 860.202048] ? abd_to_buf+0xe/0x20 [zfs]
[ 860.203299] ? abd_bio_map_off+0x1da/0x210 [zfs]
[ 860.204651] submit_bio+0x3c/0x160
[ 860.205683] __vdev_disk_physio.constprop.0+0x342/0x430 [zfs]
[ 860.207422] vdev_disk_io_start+0x8d/0x370 [zfs]
[ 860.208904] zio_vdev_io_start+0xf8/0x340 [zfs]
[ 860.210400] zio_nowait+0xb4/0x1a0 [zfs]
[ 860.211606] vdev_mirror_io_start+0xa2/0x250 [zfs]
[ 860.213140] zio_vdev_io_start+0x2c2/0x340 [zfs]
[ 860.214646] zio_execute+0x81/0x120 [zfs]
[ 860.215878] taskq_thread+0x2f0/0x530 [spl]
[ 860.217038] ? wake_up_q+0x80/0x80
[ 860.218107] ? zio_gang_tree_free+0x60/0x60 [zfs]
[ 860.219734] ? taskq_thread_spawn+0x50/0x50 [spl]
[ 860.221193] kthread+0x116/0x130
[ 860.222184] ? kthread_flush_work_fn+0x10/0x10
[ 860.223612] ret_from_fork+0x22/0x40
[ 860.224784] INFO: task txg_quiesce:6567 blocked for more than 120 seconds.
[ 860.226765] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[ 860.229344] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 860.231724] txg_quiesce D 0 6567 2 0x80004080
[ 860.233728] Call Trace:
[ 860.234783] __schedule+0x2c4/0x700
[ 860.236223] ? kfree+0x40b/0x450
[ 860.237503] schedule+0x38/0xa0
[ 860.238733] cv_wait_common+0xfb/0x130 [spl]
[ 860.240377] ? finish_wait+0x80/0x80
[ 860.242076] txg_quiesce+0x1d3/0x260 [zfs]
[ 860.243547] txg_quiesce_thread+0xf9/0x180 [zfs]
[ 860.244818] ? txg_quiesce+0x260/0x260 [zfs]
[ 860.246047] ? __thread_exit+0x20/0x20 [spl]
[ 860.247275] thread_generic_wrapper+0x6f/0x80 [spl]
[ 860.248579] kthread+0x116/0x130
[ 860.249381] ? kthread_flush_work_fn+0x10/0x10
[ 860.250351] ret_from_fork+0x22/0x40
[ 860.251162] INFO: task fio:14621 blocked for more than 120 seconds.
[ 860.252856] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[ 860.255569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 860.257599] fio D 0 14621 14167 0x00004080
[ 860.259532] Call Trace:
[ 860.260568] __schedule+0x2c4/0x700
[ 860.261739] schedule+0x38/0xa0
[ 860.263046] cv_wait_common+0xfb/0x130 [spl]
[ 860.264652] ? finish_wait+0x80/0x80
[ 860.266004] zil_commit_waiter+0xc0/0x150 [zfs]
[ 860.267767] zil_commit_impl+0x16f/0x240 [zfs]
[ 860.269514] zfs_fsync+0x70/0xf0 [zfs]
[ 860.270989] zpl_fsync+0x67/0xa0 [zfs]
[ 860.272414] ? __audit_syscall_exit+0x249/0x2a0
[ 860.274050] do_fsync+0x38/0x70
[ 860.275349] __x64_sys_fdatasync+0x13/0x20
[ 860.276906] do_syscall_64+0x5b/0x1a0
[ 860.278401] entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 860.280101] RIP: 0033:0x7f8e7a72056f
[ 860.281525] Code: Unable to access opcode bytes at RIP 0x7f8e7a720545.
[ 860.283600] RSP: 002b:00007fff3ab2b6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 860.285792] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8e7a72056f
[ 860.287680] RDX: 0000000000000000 RSI: 000055ae9bc7eec0 RDI: 0000000000000005
[ 860.289547] RBP: 00007f8e107e7000 R08: 00007fff3ab631b0 R09: 00000000001195dc
[ 860.291429] R10: 00000000001195dc R11: 0000000000000293 R12: 0000000000000004
[ 860.293318] R13: 000055ae9bc7eec0 R14: 0000000000000000 R15: 000055ae9bc7eee8
So that's kind of both a good thing and a bad thing. lol.
I can reliably repro with fio
doing writes and pg_dump
doing reads... so now I'm trying to come up with a fio
script to simulate the pg_dump
...
The thing to note here is that the issue isn't only about how fast or slow the disks are -- it's also about how much bandwidth you have to them and how saturated it is with a specific read/write pattern.
So, e.g., the reason it triggers on the Milan VM but not the Ice Lake VM is because, for whatever reason, the Milan VM has 6Gbps of storage bandwidth and the Ice Lake VM has 24Gbps, for the same CPU count.
That's probably a harder thing to trigger on locally attached disks because ZFS is usually slower than the PCIe/SAS/SATA bus, which is... maybe?... why it hasn't been reported too much.
@rincebrain the interesting thing about #12085 and this is that that one is stuck in wbt_wait
while my backtraces are after a wbt_exit
.
Could be the same thing of course. Just a slight difference.
My understanding is that in stack trace, anything with ? can be ignored. Since it can be easily reproduced, it would be nice to know if it still reproduces after wbt is turned off. Thanks.
@youzhongyang I’m working my way towards testing with wbt off.
It would be nice to have a way to simulate the reads since no one else can probably get a repro otherwise…
@rincebrain the interesting thing about #12085 and this is that that one is stuck in
wbt_wait
while my backtraces are after awbt_exit
.Could be the same thing of course. Just a slight difference.
Sure, I wasn't trying to suggest you were having the same issue, solely pointing to that example of what to set.
@rincebrain yup, got it. If I can’t repro with wbt off then…?
Is there a tool which generates data that is A.Bx compressible by ZFS’ lz4?
The data fio generates is too random and lz4 decompression doesn’t show up in perf top
when running the fio read workload.
I'm not aware of one, but since AFAIK it's stock lz4 with few modifications that should affect the compression ratios, you could reduce this to the problem of finding data stock lz4 from 2013 (which is when the initial lz4 commit went into illumos, AFAICT) will compress as much as you like, then verify it does what you want.
edit: As to your other question, if you can't repro with WBT off, then it becomes a question of who's doing something wrong, WBT or ZFS, and who's going to do something about it.
Okay, I have a repro with just fio
.
VM specs:
Shape:
VM.Standard.E4.Flex
OCPU Count:
6
Network Bandwidth (Gbps):
6
Memory (GB):
96
Local Disk:
Block Storage Only
Which turns into 12 of these:
processor : 11
vendor_id : AuthenticAMD
cpu family : 25
model : 1
model name : AMD EPYC 7J13 64-Core Processor
stepping : 1
microcode : 0x1000065
cpu MHz : 2545.214
cache size : 512 KB
physical id : 0
siblings : 12
core id : 5
cpu cores : 6
apicid : 11
initial apicid : 11
fpu : yes
fpu_exception : yes
cpuid level : 16
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core cpb ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip vaes vpclmulqdq rdpid arch_capabilities
bugs : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips : 5090.42
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
Block volumes: 12x500GB with:
Target Performance:
Lower Cost (VPU:0)
Target IOPS:
1000 IOPS
Target Throughput:
120 MB/s
Kernel / zfs:
[root@instance-20210606-1831 fio]# zpool version
zfs-2.1.0-rc7
zfs-kmod-2.1.0-rc7
[root@instance-20210606-1831 fio]# uname -a
Linux instance-20210606-1831 4.18.0-305.3.1.el8_4.x86_64 #1 SMP Tue Jun 1 13:25:54 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux
I ran this script to... wbt... more...
#!/bin/sh
for f in /sys/block/*/queue/wbt_lat_usec; do
echo 9375 > $f
done
The zpool looks like this:
[root@instance-20210606-1831 ~]# zpool list -v
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 5.81T 3.16T 2.65T - - 29% 54% 1.00x ONLINE -
sdl 496G 274G 222G - - 27% 55.3% - ONLINE
sdb 496G 285G 211G - - 31% 57.5% - ONLINE
sdc 496G 284G 212G - - 32% 57.3% - ONLINE
sdi 496G 290G 206G - - 32% 58.4% - ONLINE
sdg 496G 230G 266G - - 27% 46.3% - ONLINE
sdh 496G 294G 202G - - 31% 59.3% - ONLINE
sdn 496G 235G 261G - - 27% 47.5% - ONLINE
sdk 496G 298G 198G - - 33% 60.1% - ONLINE
sdj 496G 235G 261G - - 26% 47.4% - ONLINE
sdd 496G 287G 209G - - 30% 57.8% - ONLINE
sde 496G 283G 213G - - 31% 57.1% - ONLINE
sdf 496G 239G 257G - - 28% 48.2% - ONLINE
Created with zpool create -o ashift=12 tank /dev/...
.
The zfs:
zfs create -o compression=lz4 -o xattr=sa -o atime=off -o logbias=throughput -o recordsize=8k -o primarycache=metadata tank/db
pg_walwriter.fio
:
[global]
ioengine=sync
bs=8K
iodepth=1
fdatasync=1
group_reporting
time_based
runtime=60000
numjobs=12
name=raw-write
rw=write
[job1]
filename=/tank/db/pg_walwriter_001
size=16MB
pg_dump_read_big.fio
[global]
ioengine=sync
bs=8K
iodepth=128
direct=1
group_reporting
time_based
runtime=60000
name=raw-read
rw=read
size=512GB
numjobs=48
io_submit_mode=offload
[job1]
filename=/tank/db/pg_dump_read_big
Run them both and wait a while.
Note that the original deadlock happened with primarycache=all
but I changed it to primarycache=metadata
so the pg_dump simulation wouldn't hit ARC (data set size of the real Postgres was 2.8Tb compressed so it didn't fit into the 96GB of RAM, but enough of the 512GB did).
Here's the deadlock with just fio:
[24085.116306] INFO: task z_wr_iss_h:1007366 blocked for more than 120 seconds.
[24085.121243] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.122825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.124256] z_wr_iss_h D 0 1007366 2 0x80004080
[24085.125376] Call Trace:
[24085.125810] __schedule+0x2c4/0x700
[24085.126540] ? elv_rb_del+0x1f/0x30
[24085.127224] ? wbt_exit+0x30/0x30
[24085.127874] ? __wbt_done+0x30/0x30
[24085.128624] schedule+0x38/0xa0
[24085.129166] io_schedule+0x12/0x40
[24085.129872] rq_qos_wait+0xfa/0x170
[24085.130559] ? karma_partition+0x1e0/0x1e0
[24085.131311] ? wbt_exit+0x30/0x30
[24085.131916] wbt_wait+0x99/0xe0
[24085.132929] __rq_qos_throttle+0x23/0x30
[24085.134020] blk_mq_make_request+0x12d/0x5b0
[24085.135333] generic_make_request+0x2db/0x340
[24085.136646] ? abd_to_buf+0xe/0x20 [zfs]
[24085.137998] ? abd_bio_map_off+0x1da/0x210 [zfs]
[24085.139362] submit_bio+0x3c/0x160
[24085.140472] __vdev_disk_physio.constprop.0+0x342/0x430 [zfs]
[24085.141983] vdev_disk_io_start+0x8d/0x370 [zfs]
[24085.143240] zio_vdev_io_start+0xf8/0x340 [zfs]
[24085.144423] zio_nowait+0xb4/0x1a0 [zfs]
[24085.145617] vdev_mirror_io_start+0xa2/0x250 [zfs]
[24085.146942] zio_vdev_io_start+0x2c2/0x340 [zfs]
[24085.148370] zio_execute+0x81/0x120 [zfs]
[24085.149879] taskq_thread+0x2f0/0x530 [spl]
[24085.151034] ? wake_up_q+0x80/0x80
[24085.152263] ? zio_gang_tree_free+0x60/0x60 [zfs]
[24085.153474] ? taskq_thread_spawn+0x50/0x50 [spl]
[24085.154816] kthread+0x116/0x130
[24085.156182] ? kthread_flush_work_fn+0x10/0x10
[24085.157547] ret_from_fork+0x22/0x40
[24085.158590] INFO: task txg_quiesce:1008466 blocked for more than 120 seconds.
[24085.160272] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.162169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.164169] txg_quiesce D 0 1008466 2 0x80004080
[24085.165870] Call Trace:
[24085.166868] __schedule+0x2c4/0x700
[24085.168191] ? kfree+0x40b/0x450
[24085.169402] schedule+0x38/0xa0
[24085.170501] cv_wait_common+0xfb/0x130 [spl]
[24085.171681] ? finish_wait+0x80/0x80
[24085.172687] txg_quiesce+0x1d3/0x260 [zfs]
[24085.173762] txg_quiesce_thread+0xf9/0x180 [zfs]
[24085.174886] ? txg_quiesce+0x260/0x260 [zfs]
[24085.176005] ? __thread_exit+0x20/0x20 [spl]
[24085.177174] thread_generic_wrapper+0x6f/0x80 [spl]
[24085.178574] kthread+0x116/0x130
[24085.179562] ? kthread_flush_work_fn+0x10/0x10
[24085.180723] ret_from_fork+0x22/0x40
[24085.181859] INFO: task fio:1009087 blocked for more than 120 seconds.
[24085.183619] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.185671] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.187461] fio D 0 1009087 1009085 0x00004080
[24085.188938] Call Trace:
[24085.189956] __schedule+0x2c4/0x700
[24085.191217] schedule+0x38/0xa0
[24085.193105] cv_wait_common+0xfb/0x130 [spl]
[24085.194570] ? finish_wait+0x80/0x80
[24085.196056] zil_commit_waiter+0xc0/0x150 [zfs]
[24085.197269] zil_commit_impl+0x16f/0x240 [zfs]
[24085.198554] zfs_fsync+0x70/0xf0 [zfs]
[24085.199647] zpl_fsync+0x67/0xa0 [zfs]
[24085.200794] ? __audit_syscall_exit+0x249/0x2a0
[24085.202116] do_fsync+0x38/0x70
[24085.203126] __x64_sys_fdatasync+0x13/0x20
[24085.204247] do_syscall_64+0x5b/0x1a0
[24085.205522] entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.206915] RIP: 0033:0x7f69eec8156f
[24085.207900] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.209406] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.211347] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.213173] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.214908] RBP: 00007f6984d48000 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.216507] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.218198] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.219861] INFO: task fio:1009088 blocked for more than 120 seconds.
[24085.221460] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.223170] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.225017] fio D 0 1009088 1009085 0x00004080
[24085.226427] Call Trace:
[24085.227404] __schedule+0x2c4/0x700
[24085.228645] ? mutex_lock+0xe/0x30
[24085.229920] schedule+0x38/0xa0
[24085.231061] cv_wait_common+0xfb/0x130 [spl]
[24085.232426] ? finish_wait+0x80/0x80
[24085.233585] zil_commit_waiter+0xc0/0x150 [zfs]
[24085.234935] zil_commit_impl+0x16f/0x240 [zfs]
[24085.235912] zfs_fsync+0x70/0xf0 [zfs]
[24085.236891] zpl_fsync+0x67/0xa0 [zfs]
[24085.237953] ? __audit_syscall_exit+0x249/0x2a0
[24085.239352] do_fsync+0x38/0x70
[24085.240481] __x64_sys_fdatasync+0x13/0x20
[24085.241822] do_syscall_64+0x5b/0x1a0
[24085.242958] entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.244345] RIP: 0033:0x7f69eec8156f
[24085.245351] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.246920] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.248751] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f69eec8156f
[24085.250300] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000006
[24085.252100] RBP: 00007f6984d8ca18 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.253846] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.255608] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.257202] INFO: task fio:1009089 blocked for more than 120 seconds.
[24085.258794] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.260737] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.262414] fio D 0 1009089 1009085 0x00004080
[24085.263780] Call Trace:
[24085.264590] __schedule+0x2c4/0x700
[24085.265583] ? __switch_to_asm+0x41/0x70
[24085.266645] schedule+0x38/0xa0
[24085.267519] cv_wait_common+0xfb/0x130 [spl]
[24085.268590] ? finish_wait+0x80/0x80
[24085.269576] zil_commit_waiter+0xc0/0x150 [zfs]
[24085.270758] zil_commit_impl+0x16f/0x240 [zfs]
[24085.271906] zfs_fsync+0x70/0xf0 [zfs]
[24085.272944] zpl_fsync+0x67/0xa0 [zfs]
[24085.273964] ? __audit_syscall_exit+0x249/0x2a0
[24085.275168] do_fsync+0x38/0x70
[24085.276135] __x64_sys_fdatasync+0x13/0x20
[24085.277285] do_syscall_64+0x5b/0x1a0
[24085.278336] entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.279732] RIP: 0033:0x7f69eec8156f
[24085.280867] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.282573] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.284350] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.286039] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.287784] RBP: 00007f6984dd1430 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.289661] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.291265] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.292910] INFO: task fio:1009090 blocked for more than 120 seconds.
[24085.294367] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.296205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.298011] fio D 0 1009090 1009085 0x00004080
[24085.299397] Call Trace:
[24085.300185] __schedule+0x2c4/0x700
[24085.301177] schedule+0x38/0xa0
[24085.302137] cv_wait_common+0xfb/0x130 [spl]
[24085.303209] ? finish_wait+0x80/0x80
[24085.304283] zil_commit_waiter+0xc0/0x150 [zfs]
[24085.305520] zil_commit_impl+0x16f/0x240 [zfs]
[24085.306681] zfs_fsync+0x70/0xf0 [zfs]
[24085.307777] zpl_fsync+0x67/0xa0 [zfs]
[24085.308838] ? __audit_syscall_exit+0x249/0x2a0
[24085.309985] do_fsync+0x38/0x70
[24085.310926] __x64_sys_fdatasync+0x13/0x20
[24085.312021] do_syscall_64+0x5b/0x1a0
[24085.313062] entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.314338] RIP: 0033:0x7f69eec8156f
[24085.315316] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.316941] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.318960] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.320865] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.322509] RBP: 00007f6984e15e48 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.324074] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.325659] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.327325] INFO: task fio:1009091 blocked for more than 120 seconds.
[24085.329045] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.330867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.332518] fio D 0 1009091 1009085 0x00004080
[24085.333926] Call Trace:
[24085.334765] __schedule+0x2c4/0x700
[24085.335834] schedule+0x38/0xa0
[24085.336778] cv_wait_common+0xfb/0x130 [spl]
[24085.337943] ? finish_wait+0x80/0x80
[24085.339007] zil_commit_waiter+0xc0/0x150 [zfs]
[24085.340199] zil_commit_impl+0x16f/0x240 [zfs]
[24085.341454] zfs_fsync+0x70/0xf0 [zfs]
[24085.342456] zpl_fsync+0x67/0xa0 [zfs]
[24085.343408] ? __audit_syscall_exit+0x249/0x2a0
[24085.344499] do_fsync+0x38/0x70
[24085.345375] __x64_sys_fdatasync+0x13/0x20
[24085.346369] do_syscall_64+0x5b/0x1a0
[24085.347315] entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.348502] RIP: 0033:0x7f69eec8156f
[24085.349457] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.350972] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.352593] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.354028] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.355646] RBP: 00007f6984e5a860 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.357349] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.359321] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.360919] INFO: task fio:1009092 blocked for more than 120 seconds.
[24085.362466] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.364545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.366632] fio D 0 1009092 1009085 0x00004080
[24085.368335] Call Trace:
[24085.369415] __schedule+0x2c4/0x700
[24085.370637] schedule+0x38/0xa0
[24085.371752] cv_wait_common+0xfb/0x130 [spl]
[24085.373110] ? finish_wait+0x80/0x80
[24085.374341] zil_commit_waiter+0xc0/0x150 [zfs]
[24085.375823] zil_commit_impl+0x16f/0x240 [zfs]
[24085.377294] zfs_fsync+0x70/0xf0 [zfs]
[24085.378551] zpl_fsync+0x67/0xa0 [zfs]
[24085.379544] ? __audit_syscall_exit+0x249/0x2a0
[24085.380843] do_fsync+0x38/0x70
[24085.381916] __x64_sys_fdatasync+0x13/0x20
[24085.382987] do_syscall_64+0x5b/0x1a0
[24085.384066] entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.385376] RIP: 0033:0x7f69eec8156f
[24085.386327] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.387778] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.389504] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.391146] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.392772] RBP: 00007f6984e9f278 R08: 00007fffc3d641b0 R09: 0000000002c81846
[24085.394302] R10: 0000000002c81846 R11: 0000000000000293 R12: 0000000000000004
[24085.395859] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.397391] INFO: task fio:1009093 blocked for more than 120 seconds.
[24085.398844] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.400715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.402517] fio D 0 1009093 1009085 0x00004080
[24085.403948] Call Trace:
[24085.404786] __schedule+0x2c4/0x700
[24085.405797] ? mutex_lock+0xe/0x30
[24085.406814] schedule+0x38/0xa0
[24085.407740] cv_wait_common+0xfb/0x130 [spl]
[24085.408953] ? finish_wait+0x80/0x80
[24085.410072] zil_commit_waiter+0xc0/0x150 [zfs]
[24085.411265] zil_commit_impl+0x16f/0x240 [zfs]
[24085.412488] zfs_fsync+0x70/0xf0 [zfs]
[24085.413673] zpl_fsync+0x67/0xa0 [zfs]
[24085.414626] ? __audit_syscall_exit+0x249/0x2a0
[24085.415786] do_fsync+0x38/0x70
[24085.416731] __x64_sys_fdatasync+0x13/0x20
[24085.417898] do_syscall_64+0x5b/0x1a0
[24085.418929] entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.420225] RIP: 0033:0x7f69eec8156f
[24085.421184] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.422748] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.424491] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.426109] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.427732] RBP: 00007f6984ee3c90 R08: 00007fffc3d641b0 R09: 0000000002c81846
[24085.429383] R10: 0000000002c81846 R11: 0000000000000293 R12: 0000000000000004
[24085.430976] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.432658] INFO: task fio:1009094 blocked for more than 120 seconds.
[24085.434257] Tainted: P OE --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.436123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.437790] fio D 0 1009094 1009085 0x00004080
[24085.439133] Call Trace:
[24085.439942] __schedule+0x2c4/0x700
[24085.440949] ? mutex_lock+0xe/0x30
[24085.441998] schedule+0x38/0xa0
[24085.443003] cv_wait_common+0xfb/0x130 [spl]
[24085.444222] ? finish_wait+0x80/0x80
[24085.445235] zil_commit_waiter+0xc0/0x150 [zfs]
[24085.446736] zil_commit_impl+0x16f/0x240 [zfs]
[24085.447909] zfs_fsync+0x70/0xf0 [zfs]
[24085.449051] zpl_fsync+0x67/0xa0 [zfs]
[24085.450146] ? __audit_syscall_exit+0x249/0x2a0
[24085.451380] do_fsync+0x38/0x70
[24085.452272] __x64_sys_fdatasync+0x13/0x20
[24085.453341] do_syscall_64+0x5b/0x1a0
[24085.454420] entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.455657] RIP: 0033:0x7f69eec8156f
[24085.456645] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.458223] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.460420] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.462074] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.463679] RBP: 00007f6984f286a8 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.465485] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.467321] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
This one took roughly 3000 seconds to trigger.
Next I'll disable wbt. :)
I'm running two tests in parallel at the moment.
So far, test 1 has been running for almost 6 hours with quite a bit of I/O load:
08:29:21 up 5:52, 1 user, load average: 217.34, 217.99, 217.35
And hasn't deadlocked. I'm running the two fio workloads and the original pg_dump at the same time.
Looks promising so far.
Edit: still looks good:
[root@instance-20210606-1831 ~]# uptime
12:47:25 up 10:10, 1 user, load average: 219.59, 218.13, 218.83
[root@instance-20210606-1831 ~]# dmesg
[root@instance-20210606-1831 ~]#
On 5.4.17-2102.201.3.el8uek.x86_64
, the deadlock is similarly easy to trigger with pg_dump
and fio
running with wbt_lat_usec = 9375
.
[ 1967.106950] INFO: task z_wr_iss_h:6151 blocked for more than 122 seconds.
[ 1967.111868] Tainted: P OE 5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.113351] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.115027] z_wr_iss_h D 0 6151 2 0x80004080
[ 1967.116258] Call Trace:
[ 1967.117029] __schedule+0x2ab/0x606
[ 1967.117910] ? wbt_exit+0x30/0x2f
[ 1967.118720] ? __wbt_done+0x40/0x35
[ 1967.119566] schedule+0x33/0x9a
[ 1967.120365] io_schedule+0x16/0x39
[ 1967.121206] rq_qos_wait+0xf7/0x16c
[ 1967.122043] ? karma_partition+0x210/0x204
[ 1967.123034] ? wbt_exit+0x30/0x2f
[ 1967.123832] wbt_wait+0x9f/0xdb
[ 1967.124624] __rq_qos_throttle+0x28/0x36
[ 1967.125535] blk_mq_make_request+0x11f/0x5b1
[ 1967.126514] generic_make_request+0xcf/0x30b
[ 1967.127484] submit_bio+0x46/0x1bf
[ 1967.128488] __vdev_disk_physio.constprop.0+0x354/0x460 [zfs]
[ 1967.129731] vdev_disk_io_start+0x96/0x380 [zfs]
[ 1967.130776] zio_vdev_io_start+0x101/0x360 [zfs]
[ 1967.131814] zio_nowait+0xc3/0x1a0 [zfs]
[ 1967.132749] vdev_mirror_io_start+0xaf/0x280 [zfs]
[ 1967.133814] zio_vdev_io_start+0x2d4/0x360 [zfs]
[ 1967.134823] ? tsd_hash_search.isra.0+0x71/0xa0 [spl]
[ 1967.135929] zio_execute+0x89/0x130 [zfs]
[ 1967.136824] taskq_thread+0x31a/0x5a0 [spl]
[ 1967.137775] ? wake_up_q+0x80/0x77
[ 1967.138619] ? zio_gang_tree_free+0x70/0x70 [zfs]
[ 1967.139655] kthread+0x120/0x136
[ 1967.140450] ? taskq_thread_spawn+0x60/0x60 [spl]
[ 1967.141469] ? kthread_park+0x90/0x82
[ 1967.142484] ret_from_fork+0x2b/0x36
[ 1967.143315] INFO: task txg_quiesce:6936 blocked for more than 122 seconds.
[ 1967.144653] Tainted: P OE 5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.146095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.147564] txg_quiesce D 0 6936 2 0x80004080
[ 1967.148801] Call Trace:
[ 1967.149514] __schedule+0x2ab/0x606
[ 1967.150335] schedule+0x33/0x9a
[ 1967.151109] cv_wait_common+0x100/0x130 [spl]
[ 1967.152066] ? finish_wait+0x80/0x78
[ 1967.153014] __cv_wait+0x15/0x20 [spl]
[ 1967.153917] txg_quiesce+0x1b9/0x250 [zfs]
[ 1967.154854] txg_quiesce_thread+0xfd/0x180 [zfs]
[ 1967.155900] ? txg_quiesce+0x250/0x250 [zfs]
[ 1967.157010] thread_generic_wrapper+0x79/0x90 [spl]
[ 1967.158170] kthread+0x120/0x136
[ 1967.159022] ? __thread_exit+0x20/0x20 [spl]
[ 1967.160119] ? kthread_park+0x90/0x82
[ 1967.161048] ret_from_fork+0x2b/0x36
[ 1967.162010] INFO: task postmaster:493613 blocked for more than 122 seconds.
[ 1967.163553] Tainted: P OE 5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.165162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.166620] postmaster D 0 493613 8157 0x00004084
[ 1967.167750] Call Trace:
[ 1967.168418] __schedule+0x2ab/0x606
[ 1967.169251] schedule+0x33/0x9a
[ 1967.170025] cv_wait_common+0x100/0x130 [spl]
[ 1967.170994] ? finish_wait+0x80/0x78
[ 1967.171915] __cv_wait+0x15/0x20 [spl]
[ 1967.172927] zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.174031] zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.175156] zil_commit+0x40/0x60 [zfs]
[ 1967.176177] zfs_fsync+0x7a/0x100 [zfs]
[ 1967.177190] zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.178076] vfs_fsync_range+0x48/0x7c
[ 1967.179066] ? __audit_syscall_exit+0x24a/0x29c
[ 1967.180145] do_fsync+0x3d/0x6d
[ 1967.180979] __x64_sys_fdatasync+0x17/0x1b
[ 1967.181988] do_syscall_64+0x60/0x1cb
[ 1967.182926] entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.184207] RIP: 0033:0x7f64d6222548
[ 1967.185119] Code: Bad RIP value.
[ 1967.185980] RSP: 002b:00007ffd8a9cd6f8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 1967.187893] RAX: ffffffffffffffda RBX: 0000000000d44874 RCX: 00007f64d6222548
[ 1967.189340] RDX: 0000000000761fff RSI: 0000000000886343 RDI: 0000000000000003
[ 1967.190698] RBP: 0000000000886343 R08: 00007ffd8a9cd7c8 R09: 00007ffd8a9cd7c0
[ 1967.192130] R10: 0000000060c4e823 R11: 0000000000000246 R12: 00007f64d8e11d00
[ 1967.193483] R13: 0000886343762000 R14: 0000000000000000 R15: 00007f5c92a1a000
[ 1967.194854] INFO: task fio:502104 blocked for more than 122 seconds.
[ 1967.196121] Tainted: P OE 5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.197599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.199142] fio D 0 502104 1 0x00004084
[ 1967.200268] Call Trace:
[ 1967.200932] __schedule+0x2ab/0x606
[ 1967.201766] schedule+0x33/0x9a
[ 1967.202548] cv_wait_common+0x100/0x130 [spl]
[ 1967.203514] ? finish_wait+0x80/0x78
[ 1967.204360] __cv_wait+0x15/0x20 [spl]
[ 1967.205260] zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.206270] zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.207254] zil_commit+0x40/0x60 [zfs]
[ 1967.208143] zfs_fsync+0x7a/0x100 [zfs]
[ 1967.209086] zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.209944] vfs_fsync_range+0x48/0x7c
[ 1967.210820] ? __audit_syscall_exit+0x24a/0x29c
[ 1967.211820] do_fsync+0x3d/0x6d
[ 1967.212594] __x64_sys_fdatasync+0x17/0x1b
[ 1967.213508] do_syscall_64+0x60/0x1cb
[ 1967.214356] entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.215405] RIP: 0033:0x7fd7622f956f
[ 1967.216218] Code: Bad RIP value.
[ 1967.217046] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.218488] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fd7622f956f
[ 1967.219846] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000005
[ 1967.221217] RBP: 00007fd6f83c0000 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.222584] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.224054] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.225489] INFO: task fio:502105 blocked for more than 122 seconds.
[ 1967.226882] Tainted: P OE 5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.228443] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.230077] fio D 0 502105 1 0x00004084
[ 1967.231238] Call Trace:
[ 1967.231981] __schedule+0x2ab/0x606
[ 1967.232806] schedule+0x33/0x9a
[ 1967.233606] cv_wait_common+0x100/0x130 [spl]
[ 1967.234623] ? finish_wait+0x80/0x78
[ 1967.235493] __cv_wait+0x15/0x20 [spl]
[ 1967.236414] zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.237448] zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.238502] zil_commit+0x40/0x60 [zfs]
[ 1967.239420] zfs_fsync+0x7a/0x100 [zfs]
[ 1967.240374] zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.241246] vfs_fsync_range+0x48/0x7c
[ 1967.242094] ? __audit_syscall_exit+0x24a/0x29c
[ 1967.243060] do_fsync+0x3d/0x6d
[ 1967.243810] __x64_sys_fdatasync+0x17/0x1b
[ 1967.244733] do_syscall_64+0x60/0x1cb
[ 1967.245577] entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.246607] RIP: 0033:0x7fd7622f956f
[ 1967.247439] Code: Bad RIP value.
[ 1967.248199] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.249669] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7622f956f
[ 1967.251120] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000006
[ 1967.252501] RBP: 00007fd6f8404a18 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.254039] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.255457] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.256846] INFO: task fio:502110 blocked for more than 123 seconds.
[ 1967.258118] Tainted: P OE 5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.259535] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.261270] fio D 0 502110 1 0x00004084
[ 1967.262509] Call Trace:
[ 1967.263195] __schedule+0x2ab/0x606
[ 1967.264070] schedule+0x33/0x9a
[ 1967.265101] cv_wait_common+0x100/0x130 [spl]
[ 1967.266128] ? finish_wait+0x80/0x78
[ 1967.267099] __cv_wait+0x15/0x20 [spl]
[ 1967.268148] zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.269174] zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.270278] zil_commit+0x40/0x60 [zfs]
[ 1967.271222] zfs_fsync+0x7a/0x100 [zfs]
[ 1967.272237] zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.273298] vfs_fsync_range+0x48/0x7c
[ 1967.274239] ? __audit_syscall_exit+0x24a/0x29c
[ 1967.275284] do_fsync+0x3d/0x6d
[ 1967.276136] __x64_sys_fdatasync+0x17/0x1b
[ 1967.277252] do_syscall_64+0x60/0x1cb
[ 1967.278143] entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.279243] RIP: 0033:0x7fd7622f956f
[ 1967.280093] Code: Bad RIP value.
[ 1967.280831] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.282296] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7622f956f
[ 1967.283712] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000006
[ 1967.285124] RBP: 00007fd6f8449430 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.286777] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.288481] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.289893] INFO: task fio:502113 blocked for more than 123 seconds.
[ 1967.291141] Tainted: P OE 5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.292641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.294245] fio D 0 502113 1 0x00004084
[ 1967.295476] Call Trace:
[ 1967.296190] __schedule+0x2ab/0x606
[ 1967.297050] schedule+0x33/0x9a
[ 1967.297864] cv_wait_common+0x100/0x130 [spl]
[ 1967.298880] ? finish_wait+0x80/0x78
[ 1967.299728] __cv_wait+0x15/0x20 [spl]
[ 1967.300714] zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.301828] zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.303008] zil_commit+0x40/0x60 [zfs]
[ 1967.304121] zfs_fsync+0x7a/0x100 [zfs]
[ 1967.305095] zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.306068] vfs_fsync_range+0x48/0x7c
[ 1967.306991] ? __audit_syscall_exit+0x24a/0x29c
[ 1967.308015] do_fsync+0x3d/0x6d
[ 1967.308812] __x64_sys_fdatasync+0x17/0x1b
[ 1967.309719] do_syscall_64+0x60/0x1cb
[ 1967.310808] entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.311904] RIP: 0033:0x7fd7622f956f
[ 1967.312723] Code: Bad RIP value.
[ 1967.313523] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.315026] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fd7622f956f
[ 1967.316466] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000005
[ 1967.317884] RBP: 00007fd6f848de48 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.319332] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.320800] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.322315] INFO: task fio:502120 blocked for more than 123 seconds.
[ 1967.323674] Tainted: P OE 5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.325222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.326786] fio D 0 502120 1 0x00004084
[ 1967.328066] Call Trace:
[ 1967.328769] __schedule+0x2ab/0x606
[ 1967.329669] schedule+0x33/0x9a
[ 1967.330559] cv_wait_common+0x100/0x130 [spl]
[ 1967.331565] ? finish_wait+0x80/0x78
[ 1967.332482] __cv_wait+0x15/0x20 [spl]
[ 1967.333596] zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.334739] zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.335768] zil_commit+0x40/0x60 [zfs]
[ 1967.336780] zfs_fsync+0x7a/0x100 [zfs]
[ 1967.337793] zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.338724] vfs_fsync_range+0x48/0x7c
[ 1967.339713] ? __audit_syscall_exit+0x24a/0x29c
[ 1967.341104] do_fsync+0x3d/0x6d
[ 1967.341963] __x64_sys_fdatasync+0x17/0x1b
[ 1967.342922] do_syscall_64+0x60/0x1cb
[ 1967.343783] entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.345058] RIP: 0033:0x7fd7622f956f
[ 1967.346054] Code: Bad RIP value.
[ 1967.346827] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.348378] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fd7622f956f
[ 1967.349846] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000005
[ 1967.351267] RBP: 00007fd6f84d2860 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.352656] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.354082] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.355475] INFO: task fio:502124 blocked for more than 123 seconds.
[ 1967.356760] Tainted: P OE 5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.358204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.359687] fio D 0 502124 1 0x00004084
[ 1967.360819] Call Trace:
[ 1967.361539] __schedule+0x2ab/0x606
[ 1967.362386] schedule+0x33/0x9a
[ 1967.363206] cv_wait_common+0x100/0x130 [spl]
[ 1967.364178] ? finish_wait+0x80/0x78
[ 1967.365071] __cv_wait+0x15/0x20 [spl]
[ 1967.366019] zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.367113] zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.368161] zil_commit+0x40/0x60 [zfs]
[ 1967.369148] zfs_fsync+0x7a/0x100 [zfs]
[ 1967.370068] zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.371064] vfs_fsync_range+0x48/0x7c
[ 1967.371967] ? __audit_syscall_exit+0x24a/0x29c
[ 1967.372938] do_fsync+0x3d/0x6d
[ 1967.373688] __x64_sys_fdatasync+0x17/0x1b
[ 1967.374626] do_syscall_64+0x60/0x1cb
[ 1967.375494] entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.376537] RIP: 0033:0x7fd7622f956f
[ 1967.377372] Code: Bad RIP value.
[ 1967.378147] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.379575] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7622f956f
[ 1967.380958] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000006
[ 1967.382364] RBP: 00007fd6f8517278 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.383766] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.385168] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.386571] INFO: task fio:502125 blocked for more than 123 seconds.
[ 1967.388023] Tainted: P OE 5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.389533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.391137] fio D 0 502125 1 0x00004084
[ 1967.392325] Call Trace:
[ 1967.393012] __schedule+0x2ab/0x606
[ 1967.393868] schedule+0x33/0x9a
[ 1967.394647] cv_wait_common+0x100/0x130 [spl]
[ 1967.395625] ? finish_wait+0x80/0x78
[ 1967.396504] __cv_wait+0x15/0x20 [spl]
[ 1967.397443] zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.398476] zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.399563] zil_commit+0x40/0x60 [zfs]
[ 1967.400503] zfs_fsync+0x7a/0x100 [zfs]
[ 1967.401466] zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.402386] vfs_fsync_range+0x48/0x7c
[ 1967.403310] ? __audit_syscall_exit+0x24a/0x29c
[ 1967.404331] do_fsync+0x3d/0x6d
[ 1967.405136] __x64_sys_fdatasync+0x17/0x1b
[ 1967.406168] do_syscall_64+0x60/0x1cb
[ 1967.407037] entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.408122] RIP: 0033:0x7fd7622f956f
[ 1967.408963] Code: Bad RIP value.
[ 1967.409735] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.411228] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fd7622f956f
[ 1967.412743] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000005
[ 1967.414198] RBP: 00007fd6f855bc90 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.415645] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.417179] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
I went through every revision of blk-wbt.c
to see when RH merged the code into their kernel and it's as close to 5.4.1 as you can get but the core locking logic changed between 4.x and 5.0.1. Here's the diff from the current 5.13.0 rc (5.13.0-rc5) -- no functional difference apart from debugfs stuff:
--- 4.18.0-305.3.1/blk-wbt.c 2021-06-12 15:53:53.714494969 +0000
+++ 5.13.0-rc5/blk-wbt.c 2021-06-12 16:16:54.000000000 +0000
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
/*
* buffered writeback throttling. loosely based on CoDel. We can't drop
* packets for IO scheduling, so the logic is something like this:
@@ -517,7 +518,7 @@
rq_qos_wait(rqw, &data, wbt_inflight_cb, wbt_cleanup_cb);
}
-static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
+static inline bool wbt_should_throttle(struct bio *bio)
{
switch (bio_op(bio)) {
case REQ_OP_WRITE:
@@ -527,7 +528,7 @@
if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
(REQ_SYNC | REQ_IDLE))
return false;
- /* fallthrough */
+ fallthrough;
case REQ_OP_DISCARD:
return true;
default:
@@ -544,7 +545,7 @@
if (bio_op(bio) == REQ_OP_READ) {
flags = WBT_READ;
- } else if (wbt_should_throttle(rwb, bio)) {
+ } else if (wbt_should_throttle(bio)) {
if (current_is_kswapd())
flags |= WBT_KSWAPD;
if (bio_op(bio) == REQ_OP_DISCARD)
@@ -591,7 +592,7 @@
rq->wbt_flags |= bio_to_wbt_flags(rwb, bio);
}
-void wbt_issue(struct rq_qos *rqos, struct request *rq)
+static void wbt_issue(struct rq_qos *rqos, struct request *rq)
{
struct rq_wb *rwb = RQWB(rqos);
@@ -611,7 +612,7 @@
}
}
-void wbt_requeue(struct rq_qos *rqos, struct request *rq)
+static void wbt_requeue(struct rq_qos *rqos, struct request *rq)
{
struct rq_wb *rwb = RQWB(rqos);
if (!rwb_enabled(rwb))
@@ -706,6 +707,94 @@
}
EXPORT_SYMBOL_GPL(wbt_disable_default);
+#ifdef CONFIG_BLK_DEBUG_FS
+static int wbt_curr_win_nsec_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_wb *rwb = RQWB(rqos);
+
+ seq_printf(m, "%llu\n", rwb->cur_win_nsec);
+ return 0;
+}
+
+static int wbt_enabled_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_wb *rwb = RQWB(rqos);
+
+ seq_printf(m, "%d\n", rwb->enable_state);
+ return 0;
+}
+
+static int wbt_id_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+
+ seq_printf(m, "%u\n", rqos->id);
+ return 0;
+}
+
+static int wbt_inflight_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_wb *rwb = RQWB(rqos);
+ int i;
+
+ for (i = 0; i < WBT_NUM_RWQ; i++)
+ seq_printf(m, "%d: inflight %d\n", i,
+ atomic_read(&rwb->rq_wait[i].inflight));
+ return 0;
+}
+
+static int wbt_min_lat_nsec_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_wb *rwb = RQWB(rqos);
+
+ seq_printf(m, "%lu\n", rwb->min_lat_nsec);
+ return 0;
+}
+
+static int wbt_unknown_cnt_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_wb *rwb = RQWB(rqos);
+
+ seq_printf(m, "%u\n", rwb->unknown_cnt);
+ return 0;
+}
+
+static int wbt_normal_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_wb *rwb = RQWB(rqos);
+
+ seq_printf(m, "%u\n", rwb->wb_normal);
+ return 0;
+}
+
+static int wbt_background_show(void *data, struct seq_file *m)
+{
+ struct rq_qos *rqos = data;
+ struct rq_wb *rwb = RQWB(rqos);
+
+ seq_printf(m, "%u\n", rwb->wb_background);
+ return 0;
+}
+
+static const struct blk_mq_debugfs_attr wbt_debugfs_attrs[] = {
+ {"curr_win_nsec", 0400, wbt_curr_win_nsec_show},
+ {"enabled", 0400, wbt_enabled_show},
+ {"id", 0400, wbt_id_show},
+ {"inflight", 0400, wbt_inflight_show},
+ {"min_lat_nsec", 0400, wbt_min_lat_nsec_show},
+ {"unknown_cnt", 0400, wbt_unknown_cnt_show},
+ {"wb_normal", 0400, wbt_normal_show},
+ {"wb_background", 0400, wbt_background_show},
+ {},
+};
+#endif
+
static struct rq_qos_ops wbt_rqos_ops = {
.throttle = wbt_wait,
.issue = wbt_issue,
@@ -715,6 +804,9 @@
.cleanup = wbt_cleanup,
.queue_depth_changed = wbt_queue_depth_changed,
.exit = wbt_exit,
+#ifdef CONFIG_BLK_DEBUG_FS
+ .debugfs_attrs = wbt_debugfs_attrs,
+#endif
};
int wbt_init(struct request_queue *q)
@@ -743,7 +835,6 @@
rwb->enable_state = WBT_STATE_ON_DEFAULT;
rwb->wc = 1;
rwb->rq_depth.default_depth = RWB_DEF_DEPTH;
- wbt_update_limits(rwb);
/*
* Assign rwb and add the stats callback.
The rq_qos_wait
function lives in blk-rq-qos.c
.
And there are no functional differences between what RH ships and the current 5.13.0 rc:
--- 4.18.0-305.3.1.el8_4/blk-rq-qos.c 2021-06-12 21:04:17.186311731 +0000
+++ 5.13.0-rc5/blk-rq-qos.c 2021-06-12 21:04:33.000000000 +0000
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+
#include "blk-rq-qos.h"
/*
@@ -230,9 +232,10 @@
/**
* rq_qos_wait - throttle on a rqw if we need to
- * @private_data - caller provided specific data
- * @acquire_inflight_cb - inc the rqw->inflight counter if we can
- * @cleanup_cb - the callback to cleanup in case we race with a waker
+ * @rqw: rqw to throttle on
+ * @private_data: caller provided specific data
+ * @acquire_inflight_cb: inc the rqw->inflight counter if we can
+ * @cleanup_cb: the callback to cleanup in case we race with a waker
*
* This provides a uniform place for the rq_qos users to do their throttling.
* Since you can end up with a lot of things sleeping at once, this manages the
I'm running one final test with the 4.18.0-305.3.1.el8_4.x86_64
kernel and the original pg_dump
workload + fio
walwriter simulation with wbt_lat_usec = 0
and then... 🤷♂️ .
So far the pg_dump has been running with wbt off for around 10 hours, it's close to being finished and no deadlocks have occurred.
I have the latest kernel (5.13.0-rc5) running on a different VM and I've added a bunch of logging output to wbt and rq-qos. I've been able to trigger the deadlock there:
[ 1475.672893] INFO: task txg_quiesce:6681 blocked for more than 122 seconds.
[ 1475.680673] Tainted: P O 5.13.0-rc5 #7
[ 1475.682332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.685370] task:txg_quiesce state:D stack: 0 pid: 6681 ppid: 2 flags:0x00004000
[ 1475.688989] Call Trace:
[ 1475.690466] __schedule+0x3cf/0x8e0
[ 1475.692841] ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1475.696323] schedule+0x40/0xb0
[ 1475.698835] cv_wait_common+0x101/0x140 [spl]
[ 1475.702262] ? finish_wait+0x80/0x80
[ 1475.705119] __cv_wait+0x15/0x20 [spl]
[ 1475.707783] txg_quiesce_thread+0x2ba/0x3b0 [zfs]
[ 1475.711653] ? txg_sync_thread+0x4a0/0x4a0 [zfs]
[ 1475.714658] ? __thread_exit+0x20/0x20 [spl]
[ 1475.716365] thread_generic_wrapper+0x74/0x90 [spl]
[ 1475.718236] kthread+0x126/0x140
[ 1475.719480] ? kthread_park+0x90/0x90
[ 1475.720788] ret_from_fork+0x22/0x30
[ 1475.722118] INFO: task fio:13639 blocked for more than 122 seconds.
[ 1475.724534] Tainted: P O 5.13.0-rc5 #7
[ 1475.726018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.727833] task:fio state:D stack: 0 pid:13639 ppid: 13434 flags:0x00004000
[ 1475.729843] Call Trace:
[ 1475.730691] __schedule+0x3cf/0x8e0
[ 1475.731839] ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1475.733200] schedule+0x40/0xb0
[ 1475.734163] cv_wait_common+0x101/0x140 [spl]
[ 1475.735486] ? finish_wait+0x80/0x80
[ 1475.736560] __cv_wait+0x15/0x20 [spl]
[ 1475.737600] zil_commit_impl+0x241/0xe40 [zfs]
[ 1475.738974] zil_commit+0x3d/0x60 [zfs]
[ 1475.740086] zfs_fsync+0x77/0x100 [zfs]
[ 1475.741240] zpl_fsync+0x6c/0xa0 [zfs]
[ 1475.742378] vfs_fsync_range+0x48/0x80
[ 1475.743440] do_fsync+0x3d/0x70
[ 1475.744425] __x64_sys_fdatasync+0x17/0x20
[ 1475.745565] do_syscall_64+0x3c/0x80
[ 1475.746650] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1475.748052] RIP: 0033:0x7fad4cdd356f
[ 1475.749104] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1475.750955] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1475.752763] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1475.754491] RBP: 00007face2e9a000 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1475.756188] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1475.757927] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1475.759700] INFO: task fio:13640 blocked for more than 122 seconds.
[ 1475.761301] Tainted: P O 5.13.0-rc5 #7
[ 1475.763130] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.764990] task:fio state:D stack: 0 pid:13640 ppid: 13434 flags:0x00004000
[ 1475.766963] Call Trace:
[ 1475.767713] __schedule+0x3cf/0x8e0
[ 1475.768690] ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1475.769920] schedule+0x40/0xb0
[ 1475.770850] cv_wait_common+0x101/0x140 [spl]
[ 1475.772018] ? finish_wait+0x80/0x80
[ 1475.773084] __cv_wait+0x15/0x20 [spl]
[ 1475.774106] zil_commit_impl+0x241/0xe40 [zfs]
[ 1475.775296] zil_commit+0x3d/0x60 [zfs]
[ 1475.776406] zfs_fsync+0x77/0x100 [zfs]
[ 1475.777528] zpl_fsync+0x6c/0xa0 [zfs]
[ 1475.778630] vfs_fsync_range+0x48/0x80
[ 1475.779607] do_fsync+0x3d/0x70
[ 1475.780550] __x64_sys_fdatasync+0x17/0x20
[ 1475.781573] do_syscall_64+0x3c/0x80
[ 1475.782490] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1475.783782] RIP: 0033:0x7fad4cdd356f
[ 1475.784834] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1475.786703] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1475.788368] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1475.790020] RBP: 00007face2edea18 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1475.791750] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1475.793481] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1475.812159] INFO: task fio:13645 blocked for more than 123 seconds.
[ 1475.818468] Tainted: P O 5.13.0-rc5 #7
[ 1475.823467] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.832167] task:fio state:D stack: 0 pid:13645 ppid: 13434 flags:0x00004000
[ 1475.841001] Call Trace:
[ 1475.842841] __schedule+0x3cf/0x8e0
[ 1475.845170] ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1475.850962] schedule+0x40/0xb0
[ 1475.854814] cv_wait_common+0x101/0x140 [spl]
[ 1475.859826] ? finish_wait+0x80/0x80
[ 1475.863914] __cv_wait+0x15/0x20 [spl]
[ 1475.868357] zil_commit_impl+0x241/0xe40 [zfs]
[ 1475.873168] ? tsd_hash_search.isra.2+0x72/0xa0 [spl]
[ 1475.878508] ? __cond_resched+0x19/0x30
[ 1475.882175] zil_commit+0x3d/0x60 [zfs]
[ 1475.885621] zfs_fsync+0x77/0x100 [zfs]
[ 1475.889968] zpl_fsync+0x6c/0xa0 [zfs]
[ 1475.893861] vfs_fsync_range+0x48/0x80
[ 1475.898164] do_fsync+0x3d/0x70
[ 1475.902002] __x64_sys_fdatasync+0x17/0x20
[ 1475.906572] do_syscall_64+0x3c/0x80
[ 1475.909937] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1475.915348] RIP: 0033:0x7fad4cdd356f
[ 1475.918692] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1475.925309] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1475.931657] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1475.938185] RBP: 00007face2f23430 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1475.940354] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1475.945659] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1475.953881] INFO: task fio:13650 blocked for more than 123 seconds.
[ 1475.959671] Tainted: P O 5.13.0-rc5 #7
[ 1475.961244] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.966900] task:fio state:D stack: 0 pid:13650 ppid: 13434 flags:0x00004000
[ 1475.972529] Call Trace:
[ 1475.974065] __schedule+0x3cf/0x8e0
[ 1475.976477] ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1475.980662] schedule+0x40/0xb0
[ 1475.982984] cv_wait_common+0x101/0x140 [spl]
[ 1475.987037] ? finish_wait+0x80/0x80
[ 1475.990871] __cv_wait+0x15/0x20 [spl]
[ 1475.993905] zil_commit_impl+0x241/0xe40 [zfs]
[ 1475.996840] zil_commit+0x3d/0x60 [zfs]
[ 1475.999259] zfs_fsync+0x77/0x100 [zfs]
[ 1476.001682] zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.004030] vfs_fsync_range+0x48/0x80
[ 1476.007726] do_fsync+0x3d/0x70
[ 1476.011257] __x64_sys_fdatasync+0x17/0x20
[ 1476.015551] do_syscall_64+0x3c/0x80
[ 1476.018730] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.022309] RIP: 0033:0x7fad4cdd356f
[ 1476.026080] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.030410] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.035633] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.041674] RBP: 00007face2f67e48 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.045831] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.050323] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1476.056931] INFO: task fio:13651 blocked for more than 123 seconds.
[ 1476.063274] Tainted: P O 5.13.0-rc5 #7
[ 1476.070005] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1476.076462] task:fio state:D stack: 0 pid:13651 ppid: 13434 flags:0x00004000
[ 1476.078773] Call Trace:
[ 1476.079938] __schedule+0x3cf/0x8e0
[ 1476.081461] ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1476.082981] schedule+0x40/0xb0
[ 1476.084131] cv_wait_common+0x101/0x140 [spl]
[ 1476.085515] ? finish_wait+0x80/0x80
[ 1476.086753] __cv_wait+0x15/0x20 [spl]
[ 1476.088071] zil_commit_impl+0x241/0xe40 [zfs]
[ 1476.089910] zil_commit+0x3d/0x60 [zfs]
[ 1476.091653] zfs_fsync+0x77/0x100 [zfs]
[ 1476.092929] zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.094341] vfs_fsync_range+0x48/0x80
[ 1476.096293] do_fsync+0x3d/0x70
[ 1476.097406] __x64_sys_fdatasync+0x17/0x20
[ 1476.098626] do_syscall_64+0x3c/0x80
[ 1476.100928] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.103333] RIP: 0033:0x7fad4cdd356f
[ 1476.105919] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.110575] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.115372] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.120525] RBP: 00007face2fac860 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.125132] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.130004] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1476.135627] INFO: task fio:13653 blocked for more than 123 seconds.
[ 1476.140045] Tainted: P O 5.13.0-rc5 #7
[ 1476.143808] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1476.148478] task:fio state:D stack: 0 pid:13653 ppid: 13434 flags:0x00004000
[ 1476.154143] Call Trace:
[ 1476.156033] __schedule+0x3cf/0x8e0
[ 1476.158125] ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1476.160982] schedule+0x40/0xb0
[ 1476.163165] cv_wait_common+0x101/0x140 [spl]
[ 1476.165595] ? finish_wait+0x80/0x80
[ 1476.167565] __cv_wait+0x15/0x20 [spl]
[ 1476.169359] zil_commit_impl+0x241/0xe40 [zfs]
[ 1476.170756] zil_commit+0x3d/0x60 [zfs]
[ 1476.171867] zfs_fsync+0x77/0x100 [zfs]
[ 1476.173308] zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.174497] vfs_fsync_range+0x48/0x80
[ 1476.175470] do_fsync+0x3d/0x70
[ 1476.176484] __x64_sys_fdatasync+0x17/0x20
[ 1476.177618] do_syscall_64+0x3c/0x80
[ 1476.178764] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.180025] RIP: 0033:0x7fad4cdd356f
[ 1476.181390] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.183251] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.184992] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.187140] RBP: 00007face2ff1278 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.188884] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.190732] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1476.192817] INFO: task fio:13654 blocked for more than 123 seconds.
[ 1476.194614] Tainted: P O 5.13.0-rc5 #7
[ 1476.196167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1476.198192] task:fio state:D stack: 0 pid:13654 ppid: 13434 flags:0x00004000
[ 1476.200497] Call Trace:
[ 1476.201321] __schedule+0x3cf/0x8e0
[ 1476.202299] ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1476.203785] schedule+0x40/0xb0
[ 1476.204847] cv_wait_common+0x101/0x140 [spl]
[ 1476.206131] ? finish_wait+0x80/0x80
[ 1476.207393] __cv_wait+0x15/0x20 [spl]
[ 1476.208566] zil_commit_impl+0x241/0xe40 [zfs]
[ 1476.209991] zil_commit+0x3d/0x60 [zfs]
[ 1476.211291] zfs_fsync+0x77/0x100 [zfs]
[ 1476.212645] zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.213857] vfs_fsync_range+0x48/0x80
[ 1476.214939] do_fsync+0x3d/0x70
[ 1476.215984] __x64_sys_fdatasync+0x17/0x20
[ 1476.217298] do_syscall_64+0x3c/0x80
[ 1476.218293] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.219782] RIP: 0033:0x7fad4cdd356f
[ 1476.220794] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.222676] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.224484] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.226342] RBP: 00007face3035c90 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.228632] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.230491] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1476.232488] INFO: task fio:13656 blocked for more than 123 seconds.
[ 1476.234220] Tainted: P O 5.13.0-rc5 #7
[ 1476.235769] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1476.238163] task:fio state:D stack: 0 pid:13656 ppid: 13434 flags:0x00004000
[ 1476.240524] Call Trace:
[ 1476.241474] __schedule+0x3cf/0x8e0
[ 1476.242773] ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1476.244450] schedule+0x40/0xb0
[ 1476.245587] cv_wait_common+0x101/0x140 [spl]
[ 1476.246806] ? finish_wait+0x80/0x80
[ 1476.248046] __cv_wait+0x15/0x20 [spl]
[ 1476.249366] zil_commit_impl+0x241/0xe40 [zfs]
[ 1476.251020] zil_commit+0x3d/0x60 [zfs]
[ 1476.252354] zfs_fsync+0x77/0x100 [zfs]
[ 1476.253753] zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.255040] vfs_fsync_range+0x48/0x80
[ 1476.256043] do_fsync+0x3d/0x70
[ 1476.257211] __x64_sys_fdatasync+0x17/0x20
[ 1476.258575] do_syscall_64+0x3c/0x80
[ 1476.259608] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.260953] RIP: 0033:0x7fad4cdd356f
[ 1476.261969] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.264344] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.266401] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.268398] RBP: 00007face307a6a8 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.270290] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.272072] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1476.273833] INFO: task fio:13662 blocked for more than 123 seconds.
[ 1476.275345] Tainted: P O 5.13.0-rc5 #7
[ 1476.276861] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1476.278787] task:fio state:D stack: 0 pid:13662 ppid: 13434 flags:0x00004000
[ 1476.280830] Call Trace:
[ 1476.281685] __schedule+0x3cf/0x8e0
[ 1476.282697] ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1476.284090] schedule+0x40/0xb0
[ 1476.285184] cv_wait_common+0x101/0x140 [spl]
[ 1476.286497] ? finish_wait+0x80/0x80
[ 1476.287544] __cv_wait+0x15/0x20 [spl]
[ 1476.288690] zil_commit_impl+0x241/0xe40 [zfs]
[ 1476.289994] zil_commit+0x3d/0x60 [zfs]
[ 1476.291166] zfs_fsync+0x77/0x100 [zfs]
[ 1476.292621] zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.294031] vfs_fsync_range+0x48/0x80
[ 1476.295083] do_fsync+0x3d/0x70
[ 1476.296058] __x64_sys_fdatasync+0x17/0x20
[ 1476.297157] do_syscall_64+0x3c/0x80
[ 1476.298260] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.299595] RIP: 0033:0x7fad4cdd356f
[ 1476.300724] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.302624] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.304230] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.306194] RBP: 00007face30bf0c0 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.308274] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.309945] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
It's a little different because there's no z_wr_iss_h
backtrace but I can see that it's hung in ps
.
I'm going to study my wbt/rq-qos logging output and see if there's anything noteworthy in there. Not really sure what I'm looking for yet, and I might try to repro on xfs.
The pg_dump completed successfully on 4.18.0-305.3.1.el8_4.x86_64 with wbt turned off. dmesg was empty for the ~18 hours it took to complete:
[ 996.501515] hrtimer: interrupt took 9638 ns
[root@instance-20210606-1831 ~]# du -sh /dump/pg3/
577G /dump/pg3/
[root@instance-20210606-1831 ~]# uptime
15:54:11 up 21:52, 1 user, load average: 0.02, 0.03, 0.01
I was able to reproduce the deadlock with wbt on 5.13.0-rc5:
So I've posted about this issue over on linux-block:
I guess we'll see. :)
Just in case anyone stumbles across this issue, it has been fixed and queued for 5.14:
Unrelated to ZFS in the end, though ZFS should consider implementing its own writer throttling system as it has far more information about its activity than it exposes to the Linux block layer.
When I ran my pg_dump while running fio for writes, it took more than 3x longer with wbt off than on (18 hours vs 5 hours).
IMO you should leave this open, as it's not necessarily the case that the commit in question is guaranteed to get backported everywhere, unless someone in charge says "nah we can just tell them to run a not-broken kernel even if their distro doesn't package it".
I believe ZFS does have some degree of scheduling and backoff. I'm sure it could be refined, given the data you're sharing.
Alright.
I’m not clear on how patches get back-propagated to the other Linux kernel versions in the main line kernels but this one breaks the module ABI, so it’ll be interesting to see what RedHat does.
Without the patch any Linux system in existence could deadlock on writes essentially randomly, albeit rarely. So that’s fun.
(I believe) RH can and will backport absolutely anything if they feel strongly enough about it (and if their customer systems start hanging, I imagine they will). I am no longer clear on the timetables between RHEL 8, CentOS 8, and CentOS Stream 8 kernels, but since RHEL only releases their deltas against mainline as massive blobs now, I can't see them not including it if RHEL does. God only knows what OEL will do.
I believe other distros tend to be more limited in what they'll grab - but that patch doesn't look really invasive, to me, so I imagine if someone reports the issue happening on a distro's kernel, they might well pull it.
Alright.
I’m not clear on how patches get back-propagated to the other Linux kernel versions in the main line kernels but this one breaks the module ABI, so it’ll be interesting to see what RedHat does.
Without the patch any Linux system in existence could deadlock on writes essentially randomly, albeit rarely. So that’s fun.
youzhongyang at https://github.com/openzfs/zfs/issues/12085#issuecomment-844474354 suggested that setting wbt_lat_usec to 0 might potentially mitigate the issue for the time being
We hope by turning wbt off (/sys/block/*/queue/wbt_lat_usec set to 0), it will never happen again. Time will tell if it helps.
I'll give it a try as well
for file in /sys/block/*/queue/wbt_lat_usec; do echo "0" > $file; done
should do it
If you look 20+ comments up, you'll see that the entire recent discussion, in which they disclose the impact of changing that setting, and point ultimately to the bug involving the WBT, was started with a link to that comment.
So yes, that's been covered.
FWIW, the wbt fix has been merged into linus' tree:
Hopefully this is something people can advocate to have included in distro kernels.
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
System information
Describe the problem you're observing
Write I/O blocked on ZFS file system.
Describe how to reproduce the problem
I was running a
pg_dump
of a large-ish (5Tb) database from a ZFS filesystem to a XFS filesystem.Include any warning/errors/backtraces from the system logs
Full kernel backtrace included here:
https://gist.github.com/omarkilani/e722e4ee2f91914e5b9a243ebb34f2b3
Module is running with default options.
I can read all data from file system even though all writers are blocked.
I can create files on the file system but a 'sync' hangs indefinitely.
No other messages written to the kernel buffer.
The initiating process that caused the hang:
It doesn't respond to
strace
or any other system tap type tools.Running on AMD Milan CPUs: