z_wr_iss_h hangs indefinitely in wbt_wait with saturated I/O bandwidth and wbt enabled

omarkilani commented 3 years ago

System information

Type	Version/Name
Distribution Name	RHEL
Distribution Version	8.4
Linux Kernel	4.18.0-305.3.1.el8_4
Architecture	x86_64
ZFS Version	2.1.0-rc6
SPL Version	2.1.0-rc6

Describe the problem you're observing

Write I/O blocked on ZFS file system.

Describe how to reproduce the problem

I was running a pg_dump of a large-ish (5Tb) database from a ZFS filesystem to a XFS filesystem.

Include any warning/errors/backtraces from the system logs

Full kernel backtrace included here:

https://gist.github.com/omarkilani/e722e4ee2f91914e5b9a243ebb34f2b3

Module is running with default options.

I can read all data from file system even though all writers are blocked.

I can create files on the file system but a 'sync' hangs indefinitely.

No other messages written to the kernel buffer.

The initiating process that caused the hang:

postgres  615812  0.0  0.0 34718200 21804 ?      Ds   02:10   0:06 postgres: walwriter

It doesn't respond to strace or any other system tap type tools.

Running on AMD Milan CPUs:

processor   : 0
vendor_id   : AuthenticAMD
cpu family  : 25
model       : 1
model name  : AMD EPYC 7J13 64-Core Processor
stepping    : 1
microcode   : 0x1000065
cpu MHz     : 2545.216
cache size  : 512 KB
physical id : 0
siblings    : 12
core id     : 0
cpu cores   : 6
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 16
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core cpb ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip vaes vpclmulqdq rdpid arch_capabilities
bugs        : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips    : 5090.43
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

omarkilani commented 3 years ago

I rebooted the VM, ran scrub and everything was fine on the pool.

I can write to it normally and what not. Then I ran pg_dump again and z_wr_iss_h hung again.

I've changed the VM from an AMD Milan CPU to a Intel Ice Lake CPU:

model name  : Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz

And weirdly... pg_dump looks to be working normally there.

It has the exact same boot volume, kernel, zfs module, pool, etc.

The only difference is the CPU.

Pretty weird.

omarkilani commented 3 years ago

Well... the pg_dump completed successfully on the Ice Lake processor. Dumped 5Tb of data (2.8Tb lz4 compressed) from ZFS into about 600Gb of gzip'ed data on XFS.

I don't even know where to begin with "AMD EPYC processor causes zfs to hang". Sigh.

rincebrain commented 3 years ago

I think I know (since I don't know of any other virtualization solutions that let you flip that bit so easily), but precisely what virtualization tool and version are you using? And do you have any idea how far into the pg_dump it gets before hanging? (I presume the version of postgres is whatever's latest stable in RHEL 8.4?)

omarkilani commented 3 years ago

Hey @rincebrain,

These are VMs running at Oracle Cloud (where I was doing the arm64 testing before).

It's kvm based:

[    0.000000] efi: EFI v2.70 by EDK II
[    0.000000] efi:  SMBIOS=0xbf742000  ACPI=0xbf97e000  ACPI 2.0=0xbf97e014  MEMATTR=0xbe0fc018  MOKvar=0xbf73c000 
[    0.000000] secureboot: Secure boot disabled
[    0.000000] SMBIOS 2.8 present.
[    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.4.1 12/03/2020
[    0.000000] Hypervisor detected: KVM
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: cpu 0, msr b51401001, primary cpu clock
[    0.000000] kvm-clock: using sched offset of 9596802563 cycles
[    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] kvm-guest: KVM setup pv remote TLB flush
[    0.000000] kvm-guest: setup PV sched yield
[    0.000000] kvm-guest: stealtime: cpu 0, msr 17cfe2c080
[    0.000000] kvm-guest: PV spinlocks enabled
[    0.001311] kvm-guest: setup PV IPIs
[    0.001000] kvm-clock: cpu 1, msr b51401041, secondary cpu clock
[    0.020047] kvm-guest: stealtime: cpu 1, msr 17cfe6c080
[    0.001000] kvm-clock: cpu 2, msr b51401081, secondary cpu clock
[    0.020385] kvm-guest: stealtime: cpu 2, msr 17cfeac080
[    0.001000] kvm-clock: cpu 3, msr b514010c1, secondary cpu clock
[    0.021063] kvm-guest: stealtime: cpu 3, msr 17cfeec080
[    0.001000] kvm-clock: cpu 4, msr b51401101, secondary cpu clock
[    0.021341] kvm-guest: stealtime: cpu 4, msr 17cff2c080
[    0.001000] kvm-clock: cpu 5, msr b51401141, secondary cpu clock
[    0.021341] kvm-guest: stealtime: cpu 5, msr 17cff6c080
[    0.001000] kvm-clock: cpu 6, msr b51401181, secondary cpu clock
[    0.022299] kvm-guest: stealtime: cpu 6, msr 17cffac080
[    0.001000] kvm-clock: cpu 7, msr b514011c1, secondary cpu clock
[    0.022501] kvm-guest: stealtime: cpu 7, msr 17cffec080
[    0.001000] kvm-clock: cpu 8, msr b51401201, secondary cpu clock
[    0.023334] kvm-guest: stealtime: cpu 8, msr 17d002c080
[    0.001000] kvm-clock: cpu 9, msr b51401241, secondary cpu clock
[    0.023367] kvm-guest: stealtime: cpu 9, msr 17d006c080
[    0.001000] kvm-clock: cpu 10, msr b51401281, secondary cpu clock
[    0.024193] kvm-guest: stealtime: cpu 10, msr 17d00ac080
[    0.001000] kvm-clock: cpu 11, msr b514012c1, secondary cpu clock
[    0.024374] kvm-guest: stealtime: cpu 11, msr 17d00ec080
[    0.161088] clocksource: Switched to clocksource kvm-clock
[    0.928366] systemd[1]: Detected virtualization kvm.
[   11.028118] systemd[1]: Detected virtualization kvm.

This is a rebuilt SRPM from the PGDG project:

https://git.postgresql.org/gitweb/?p=pgrpms.git;a=tree;f=rpm/redhat/master/non-common/postgresql-13/EL-8;h=f798d19efd8480666e0bd6fdb852e90b064ce470;hb=HEAD

Since it's x86_64 I just used the Postgres 13 RPMs we use in production. We have some custom Postgres extensions that we add to the RPMs but they're not used in the pg_dump process.

The pg_dump got to 330Gb the first time, and 390Gb the second time.

omarkilani commented 3 years ago

Okay, so the other weird thing is that a pg_restore onto the ZFS file system from the XFS file system works fine on the EPYC VM. That's 2.8TB lz4 compressed / 5TB logicalused.

The pg_dump is a read workload and I had N = number of CPU dump jobs running, which is fairly intensive (the pg_restore is also N = number of CPUs and that seems okay).

So maybe the database went to write something out while that was going on and z_wr_iss_h / txg_quiesce hung for whatever reason. Some weird locking thing on EPYC? 🤷‍♂️

rincebrain commented 3 years ago

Yeah, I was assuming similarly that it was an issue with heavy reads and then some writes while they go, and planning to try experimenting with 400+ GiB pgsql databases (though I cannot, as it turns out, map 64 logical cores to 64 distinct real cores, locally), but haven't gotten a chance to yet.

I'm slightly surprised more people haven't screamed if there's a problem common enough for you to hit, even if Epyc-specific, unless you got really lucky. Oh well, more poking around will answer that.

omarkilani commented 3 years ago

Milan is fairly new and has had some major internal changes so I’m going to try it on the previous gen EPYC (Rome) at both OCI and GCP.

It might be possible for you to generate enough data with pgbench -i -s NNNNNN where the scale factor is huge — set fsync = off in the postgresql.conf beforehand to make that go a little faster.

I think a value of 100 generates something like 25GB of real data.

omarkilani commented 3 years ago

Okay... Something Weird (tm) is going on with these Milan machines.

The kernel is spending significant time in native_write_msr, native_apic_msr_eoi_write and native_read_msr.

Maybe something is going on with locking... for some reason. Or the hypervisor isn't happy on Milan for whatever reason.

Maybe interrupts / X2APIC.

rincebrain commented 3 years ago

Can you try scaling down the {pg_dump threads, that + actual VM cores} and see if performance actually improves with one or both?

omarkilani commented 3 years ago

Yup. I'm just going to finish this test on OCI Milan VM / OCI Rome VM / GCP n2d Rome VM first since I happened to have the same data in both places.

FWIW, I'm using 12 CPU VMs for Rome/Milan and Ice Lake at OCI and running with -j 12.

Just going by my possibly hazy memory... the output of perf top while running on the Milan vs the Ice Lake looks very different. I'll have to double check that.

(I'm just using this ticket to keep all the data in one place -- feel free to ignore my rants etc.)

omarkilani commented 3 years ago

On GCP, I'm using a n2d-highmem-16 shape since it's the closest to the OCI E4.Flex shape with 12 CPUs and 96GB of RAM.

I'm trying to set up the same environment for ZFS by attaching 16 NVMe disks to it and using them all in a pool. On OCI I have 12 members in the pool x 500GB, on GCP I have 16 members in the pool x 375GB, as you can only run multiples of 8.

I have to copy the data off block storage since that's on XFS onto the zpool. That's taking a while because those NVMe disks are extremely slow for writes.

I'm running the rhel-8-v20210609 image with the same 4.18.0-305.3.1.el8_4.x86_64 kernel.

All RPMs used are the same across both OCI / GCP (zfs, Postgres, etc).

Once the data is copied over I'll run pg_dump with the same parameters as I used before but I'll use -j 16 instead.

omarkilani commented 3 years ago

I'm not sure using less cores will change much as it looks like it's only using 7 cores at that point.

[root@instance-20210606-1831 ~]# du -sh /dump/pg3/
372G    /dump/pg3/
[root@instance-20210606-1831 ~]# ps auxww | grep COPY
postgres 1612463 42.4  0.2 34719172 243888 ?     Ds   02:23  50:51 postgres: postgres x [local] COPY
postgres 1612466 48.9  0.2 34719288 255328 ?     Ds   02:23  58:40 postgres: postgres x [local] COPY
postgres 1612469 17.1  0.2 34719176 243828 ?     Ds   02:23  20:30 postgres: postgres x [local] COPY
postgres 1612472 30.6  0.6 34723496 650632 ?     Ds   02:23  36:40 postgres: postgres x [local] COPY
postgres 1612473 32.2  0.2 34719176 243816 ?     Ds   02:23  38:37 postgres: postgres x [local] COPY
postgres 1612476 14.0 19.8 34777296 19582612 ?   Ds   02:23  16:48 postgres: postgres x [local] COPY
postgres 1612478 18.1 17.6 34782452 17420144 ?   Rs   02:23  21:41 postgres: postgres x [local] COPY

These are my pg_dump args:

/usr/pgsql-13/bin/pg_dump -Fd -f /dump/pg3 -Z 1 -v -j 12 --no-sync x

/dump is a XFS file system mounted with -o noatime.

The source database is Postgres 11.12, though. It shouldn't matter but for the sake of presenting all the data points.

omarkilani commented 3 years ago

The pg_dump was successful on GCP, but... it was also successful at OCI for the first time.

The two differences at OCI vs the other 3 times I ran it:

I'd pg_restore'ed the dump into a new Postgres 13.3 database on the same ZFS filesystem, so I had written to roughly 90% of capacity.
I increased the storage performance slider at OCI from 10 to 20. This is like an artificial performance limiter so you pay more for the larger number/bandwidth/IOPs. The slider maps to GCP performance levels roughly like:

OCI 0 = GCP Standard OCI 10 = GCP Balanced OCI 20 = GCP SSD

I have a snapshot of the pool in the exact state where it was hanging, so I'll restore back to that after I adjust the slider down to 10 again.

omarkilani commented 3 years ago

Rather annoyingly for repro purposes, Oracle upped the performance of the various slider settings... today:

https://blogs.oracle.com/cloud-infrastructure/300000-iops-per-volume-with-ultra-high-performance-oci-block-volumes

This could explain why it magically started working. I'm going to try a couple of other things but... I don't think ZFS should lock up in general. lol.

Sigh...

rincebrain commented 3 years ago

I would tend to agree, re: lockups. (Lest you think we're somehow on different pages here.)

omarkilani commented 3 years ago

I restored the snapshot so I can do this with what used to break it before. I'm running that pg_dump now on the AMD Milan VM at OCI.

Every 5 seconds or so the postgres walwriter process writes out something like 8kb (though strace's write shows 16kb sometimes, for whatever reason) to the WAL.

epoll_create1(EPOLL_CLOEXEC)            = 4
epoll_ctl(4, EPOLL_CTL_ADD, 11, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587608, u64=31587608}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587632, u64=31587632}}) = 0
epoll_wait(4, [], 1, 200)               = 0
close(4)                                = 0
write(3, "\230\320\5\0\1\0\0\0\0@\377\222e\210\0\0E\32\0\0\0\0\0\0\0\0 XXXXX"..., 16384) = 16384
fdatasync(3)                            = 0
epoll_create1(EPOLL_CLOEXEC)            = 4
epoll_ctl(4, EPOLL_CTL_ADD, 11, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587608, u64=31587608}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587632, u64=31587632}}) = 0
epoll_wait(4, [], 1, 200)               = 0
close(4)                                = 0
write(3, "\230\320\5\0\1\0\0\0\0\200\377\222e\210\0\0\371\30\0\0\0\0\0 XXXXX"..., 8192) = 8192
fdatasync(3)                            = 0
epoll_create1(EPOLL_CLOEXEC)            = 4
epoll_ctl(4, EPOLL_CTL_ADD, 11, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587608, u64=31587608}}) = 0
epoll_ctl(4, EPOLL_CTL_ADD, 7, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=31587632, u64=31587632}}) = 0
epoll_wait(4, [], 1, 200)               = 0
close(4)                                = 0
write(3, "\230\320\5\0\1\0\0\0\0\240\377\222e\210\0\0%\30\0\0\0\0\0\0 XXXXXX"..., 8192) = 8192
fdatasync(3)                            = 0

And fd/3 is:

lrwx------ 1 postgres postgres 64 Jun 10 23:17 /proc/496961/fd/3 -> /tank/db/x/pg_wal/000000010000886500000093

So I think that's the kind of thing that got it stuck before. It's not high write traffic at all, but it does call fdatasync on the fd.

omarkilani commented 3 years ago

Okay, I can no longer reproduce this on the exact same VM setup with the exact same data -- the upped storage performance limits seem to have played a role in that.

I'm going to retry again with the slowest VPU thing they have. Unfortunately that one is literal spinning disks and it's taking a while for it to move the data across. But should be a good test.

omarkilani commented 3 years ago

I managed to repro it with a fio test case, woo. 🎉

I got to 416GB of dumped data but I triggered the exact same lockup.

Here's the fio script, it simulates the postgres walwriter workload of 8kb writes and a fdatasync on the fd with 16MB files:

[global]
ioengine=sync
bs=8K
iodepth=1
fdatasync=1
group_reporting
time_based
runtime=6000
numjobs=12
name=raw-write
rw=write

[job1]
filename=/tank/db/foo
size=16MB

You run this alongside the pg_dump and sooner or later you trigger...

[28016.449085] INFO: task z_wr_iss_h:9261 blocked for more than 120 seconds.
[28016.455939]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.458492] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.461122] z_wr_iss_h      D    0  9261      2 0x80004080
[28016.462863] Call Trace:
[28016.463943]  __schedule+0x2c4/0x700
[28016.465321]  ? wbt_exit+0x30/0x30
[28016.466676]  ? __wbt_done+0x30/0x30
[28016.469090]  schedule+0x38/0xa0
[28016.471471]  io_schedule+0x12/0x40
[28016.473133]  rq_qos_wait+0xfa/0x170
[28016.474556]  ? karma_partition+0x1e0/0x1e0
[28016.476261]  ? wbt_exit+0x30/0x30
[28016.477845]  wbt_wait+0x99/0xe0
[28016.479390]  __rq_qos_throttle+0x23/0x30
[28016.480970]  blk_mq_make_request+0x12d/0x5b0
[28016.482775]  generic_make_request+0x2db/0x340
[28016.485011]  ? abd_to_buf+0xe/0x20 [zfs]
[28016.487466]  ? abd_bio_map_off+0x1da/0x210 [zfs]
[28016.490301]  submit_bio+0x3c/0x160
[28016.492236]  __vdev_disk_physio.constprop.0+0x342/0x430 [zfs]
[28016.494420]  vdev_disk_io_start+0x8d/0x370 [zfs]
[28016.496037]  zio_vdev_io_start+0xf8/0x340 [zfs]
[28016.497436]  zio_nowait+0xb4/0x1a0 [zfs]
[28016.498961]  vdev_mirror_io_start+0xa2/0x250 [zfs]
[28016.500774]  zio_vdev_io_start+0x2c2/0x340 [zfs]
[28016.502390]  zio_execute+0x81/0x120 [zfs]
[28016.503882]  taskq_thread+0x2f0/0x530 [spl]
[28016.505363]  ? wake_up_q+0x80/0x80
[28016.506733]  ? zio_gang_tree_free+0x60/0x60 [zfs]
[28016.508289]  ? taskq_thread_spawn+0x50/0x50 [spl]
[28016.510009]  kthread+0x116/0x130
[28016.511202]  ? kthread_flush_work_fn+0x10/0x10
[28016.512842]  ret_from_fork+0x22/0x40
[28016.514734] INFO: task txg_quiesce:10004 blocked for more than 120 seconds.
[28016.518027]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.521945] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.524217] txg_quiesce     D    0 10004      2 0x80004080
[28016.526047] Call Trace:
[28016.527039]  __schedule+0x2c4/0x700
[28016.528116]  ? kfree+0x40b/0x450
[28016.529107]  schedule+0x38/0xa0
[28016.530051]  cv_wait_common+0xfb/0x130 [spl]
[28016.531215]  ? finish_wait+0x80/0x80
[28016.532617]  txg_quiesce+0x1d3/0x260 [zfs]
[28016.533706]  txg_quiesce_thread+0xf9/0x180 [zfs]
[28016.534833]  ? txg_quiesce+0x260/0x260 [zfs]
[28016.535863]  ? __thread_exit+0x20/0x20 [spl]
[28016.536907]  thread_generic_wrapper+0x6f/0x80 [spl]
[28016.538036]  kthread+0x116/0x130
[28016.538872]  ? kthread_flush_work_fn+0x10/0x10
[28016.540056]  ret_from_fork+0x22/0x40
[28016.541027] INFO: task fio:3958565 blocked for more than 120 seconds.
[28016.542512]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.544334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.546814] fio             D    0 3958565 3957974 0x00004080
[28016.549112] Call Trace:
[28016.549885]  __schedule+0x2c4/0x700
[28016.550870]  ? mutex_lock+0xe/0x30
[28016.551859]  schedule+0x38/0xa0
[28016.552679]  cv_wait_common+0xfb/0x130 [spl]
[28016.553775]  ? finish_wait+0x80/0x80
[28016.554824]  zil_commit_waiter+0xc0/0x150 [zfs]
[28016.556051]  zil_commit_impl+0x16f/0x240 [zfs]
[28016.557413]  zfs_fsync+0x70/0xf0 [zfs]
[28016.558768]  zpl_fsync+0x67/0xa0 [zfs]
[28016.559959]  ? __audit_syscall_exit+0x249/0x2a0
[28016.561112]  do_fsync+0x38/0x70
[28016.562181]  __x64_sys_fdatasync+0x13/0x20
[28016.563461]  do_syscall_64+0x5b/0x1a0
[28016.564687]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.566206] RIP: 0033:0x7f0f4fd8c56f
[28016.567299] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.568686] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.570452] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.572179] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.573876] RBP: 00007f0ee5e53000 R08: 00007ffdfdfa51b0 R09: 0000000002bb932e
[28016.575551] R10: 0000000002bb932e R11: 0000000000000293 R12: 0000000000000004
[28016.577233] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.579153] INFO: task fio:3958567 blocked for more than 120 seconds.
[28016.581019]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.582960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.584825] fio             D    0 3958567 3957974 0x00004080
[28016.586365] Call Trace:
[28016.587266]  __schedule+0x2c4/0x700
[28016.588307]  ? mutex_lock+0xe/0x30
[28016.589357]  schedule+0x38/0xa0
[28016.590228]  cv_wait_common+0xfb/0x130 [spl]
[28016.591434]  ? finish_wait+0x80/0x80
[28016.592500]  zil_commit_waiter+0xc0/0x150 [zfs]
[28016.593797]  zil_commit_impl+0x16f/0x240 [zfs]
[28016.595159]  zfs_fsync+0x70/0xf0 [zfs]
[28016.596411]  zpl_fsync+0x67/0xa0 [zfs]
[28016.597468]  ? __audit_syscall_exit+0x249/0x2a0
[28016.598595]  do_fsync+0x38/0x70
[28016.599560]  __x64_sys_fdatasync+0x13/0x20
[28016.600681]  do_syscall_64+0x5b/0x1a0
[28016.601673]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.603006] RIP: 0033:0x7f0f4fd8c56f
[28016.604062] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.606699] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.608554] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.610267] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.611892] RBP: 00007f0ee5e97a18 R08: 00007ffdfdfa51b0 R09: 0000000002bb932c
[28016.613764] R10: 0000000002bb932c R11: 0000000000000293 R12: 0000000000000004
[28016.615669] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.617555] INFO: task fio:3958569 blocked for more than 120 seconds.
[28016.619367]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.621845] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.624002] fio             D    0 3958569 3957974 0x00004080
[28016.625825] Call Trace:
[28016.627124]  __schedule+0x2c4/0x700
[28016.628140]  ? mutex_lock+0xe/0x30
[28016.629123]  schedule+0x38/0xa0
[28016.630136]  cv_wait_common+0xfb/0x130 [spl]
[28016.631385]  ? finish_wait+0x80/0x80
[28016.632597]  zil_commit_waiter+0xc0/0x150 [zfs]
[28016.634062]  zil_commit_impl+0x16f/0x240 [zfs]
[28016.635601]  zfs_fsync+0x70/0xf0 [zfs]
[28016.636980]  zpl_fsync+0x67/0xa0 [zfs]
[28016.638231]  ? __audit_syscall_exit+0x249/0x2a0
[28016.639679]  do_fsync+0x38/0x70
[28016.640881]  __x64_sys_fdatasync+0x13/0x20
[28016.642236]  do_syscall_64+0x5b/0x1a0
[28016.643502]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.644926] RIP: 0033:0x7f0f4fd8c56f
[28016.645866] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.647368] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.648971] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.650889] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.652427] RBP: 00007f0ee5edc430 R08: 00007ffdfdfa51b0 R09: 0000000002bb9334
[28016.653962] R10: 0000000002bb9334 R11: 0000000000000293 R12: 0000000000000004
[28016.655654] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.657371] INFO: task fio:3958570 blocked for more than 120 seconds.
[28016.658869]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.660724] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.662515] fio             D    0 3958570 3957974 0x00004080
[28016.663915] Call Trace:
[28016.664673]  __schedule+0x2c4/0x700
[28016.665714]  schedule+0x38/0xa0
[28016.666654]  cv_wait_common+0xfb/0x130 [spl]
[28016.667882]  ? finish_wait+0x80/0x80
[28016.668986]  zil_commit_waiter+0xc0/0x150 [zfs]
[28016.670310]  zil_commit_impl+0x16f/0x240 [zfs]
[28016.671606]  zfs_fsync+0x70/0xf0 [zfs]
[28016.672763]  zpl_fsync+0x67/0xa0 [zfs]
[28016.674226]  do_fsync+0x38/0x70
[28016.675469]  __x64_sys_fdatasync+0x13/0x20
[28016.676919]  do_syscall_64+0x5b/0x1a0
[28016.677976]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.679403] RIP: 0033:0x7f0f4fd8c56f
[28016.680706] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.682379] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.684144] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.685890] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.687750] RBP: 00007f0ee5f20e48 R08: 00007ffdfdfa51b0 R09: 0000000002bb9334
[28016.689356] R10: 0000000002bb9334 R11: 0000000000000293 R12: 0000000000000004
[28016.690996] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.692600] INFO: task fio:3958573 blocked for more than 120 seconds.
[28016.694338]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.696557] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.698712] fio             D    0 3958573 3957974 0x00004080
[28016.700407] Call Trace:
[28016.701434]  __schedule+0x2c4/0x700
[28016.702744]  ? mutex_lock+0xe/0x30
[28016.703973]  schedule+0x38/0xa0
[28016.705124]  cv_wait_common+0xfb/0x130 [spl]
[28016.706783]  ? finish_wait+0x80/0x80
[28016.708072]  zil_commit_waiter+0xc0/0x150 [zfs]
[28016.710225]  zil_commit_impl+0x16f/0x240 [zfs]
[28016.712044]  zfs_fsync+0x70/0xf0 [zfs]
[28016.713468]  zpl_fsync+0x67/0xa0 [zfs]
[28016.714594]  ? __audit_syscall_exit+0x249/0x2a0
[28016.715831]  do_fsync+0x38/0x70
[28016.716771]  __x64_sys_fdatasync+0x13/0x20
[28016.718147]  do_syscall_64+0x5b/0x1a0
[28016.719520]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.721425] RIP: 0033:0x7f0f4fd8c56f
[28016.722923] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.725131] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.727838] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.730170] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.732661] RBP: 00007f0ee5f65860 R08: 00007ffdfdfa51b0 R09: 0000000002bb932c
[28016.735274] R10: 0000000002bb932c R11: 0000000000000293 R12: 0000000000000004
[28016.737570] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.740189] INFO: task fio:3958574 blocked for more than 120 seconds.
[28016.742785]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.745553] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.748131] fio             D    0 3958574 3957974 0x00004080
[28016.749949] Call Trace:
[28016.751033]  __schedule+0x2c4/0x700
[28016.752479]  ? mutex_lock+0xe/0x30
[28016.753888]  schedule+0x38/0xa0
[28016.755208]  cv_wait_common+0xfb/0x130 [spl]
[28016.756663]  ? finish_wait+0x80/0x80
[28016.757837]  zil_commit_waiter+0xc0/0x150 [zfs]
[28016.759437]  zil_commit_impl+0x16f/0x240 [zfs]
[28016.761078]  zfs_fsync+0x70/0xf0 [zfs]
[28016.762392]  zpl_fsync+0x67/0xa0 [zfs]
[28016.763768]  ? __audit_syscall_exit+0x249/0x2a0
[28016.765521]  do_fsync+0x38/0x70
[28016.766773]  __x64_sys_fdatasync+0x13/0x20
[28016.768313]  do_syscall_64+0x5b/0x1a0
[28016.769717]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.771559] RIP: 0033:0x7f0f4fd8c56f
[28016.772646] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.775088] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.778161] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.780521] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.783039] RBP: 00007f0ee5faa278 R08: 00007ffdfdfa51b0 R09: 0000000002bb932c
[28016.785692] R10: 0000000002bb932c R11: 0000000000000293 R12: 0000000000000004
[28016.788147] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.790588] INFO: task fio:3958576 blocked for more than 120 seconds.
[28016.792348]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.794298] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.796761] fio             D    0 3958576 3957974 0x00004080
[28016.798344] Call Trace:
[28016.799236]  __schedule+0x2c4/0x700
[28016.800475]  ? mutex_lock+0xe/0x30
[28016.801519]  schedule+0x38/0xa0
[28016.802672]  cv_wait_common+0xfb/0x130 [spl]
[28016.804259]  ? finish_wait+0x80/0x80
[28016.805466]  zil_commit_waiter+0xc0/0x150 [zfs]
[28016.807268]  zil_commit_impl+0x16f/0x240 [zfs]
[28016.809165]  zfs_fsync+0x70/0xf0 [zfs]
[28016.810358]  zpl_fsync+0x67/0xa0 [zfs]
[28016.811747]  ? __audit_syscall_exit+0x249/0x2a0
[28016.813127]  do_fsync+0x38/0x70
[28016.814039]  __x64_sys_fdatasync+0x13/0x20
[28016.815382]  do_syscall_64+0x5b/0x1a0
[28016.816522]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.817880] RIP: 0033:0x7f0f4fd8c56f
[28016.819011] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.820726] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.823542] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.825779] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.827416] RBP: 00007f0ee5feec90 R08: 00007ffdfdfa51b0 R09: 0000000002bb932e
[28016.829073] R10: 0000000002bb932e R11: 0000000000000293 R12: 0000000000000004
[28016.830771] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8
[28016.832418] INFO: task fio:3958577 blocked for more than 120 seconds.
[28016.833914]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[28016.835877] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[28016.837702] fio             D    0 3958577 3957974 0x00004080
[28016.839188] Call Trace:
[28016.840073]  __schedule+0x2c4/0x700
[28016.841097]  ? mutex_lock+0xe/0x30
[28016.842127]  schedule+0x38/0xa0
[28016.843128]  cv_wait_common+0xfb/0x130 [spl]
[28016.844529]  ? finish_wait+0x80/0x80
[28016.846295]  zil_commit_waiter+0xc0/0x150 [zfs]
[28016.848088]  zil_commit_impl+0x16f/0x240 [zfs]
[28016.849427]  zfs_fsync+0x70/0xf0 [zfs]
[28016.850512]  zpl_fsync+0x67/0xa0 [zfs]
[28016.851554]  ? __audit_syscall_exit+0x249/0x2a0
[28016.852813]  do_fsync+0x38/0x70
[28016.853824]  __x64_sys_fdatasync+0x13/0x20
[28016.854903]  do_syscall_64+0x5b/0x1a0
[28016.855874]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[28016.857163] RIP: 0033:0x7f0f4fd8c56f
[28016.858154] Code: Unable to access opcode bytes at RIP 0x7f0f4fd8c545.
[28016.859788] RSP: 002b:00007ffdfdf87bf0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[28016.861503] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f0f4fd8c56f
[28016.863144] RDX: 0000000000000000 RSI: 000055ac19cebec0 RDI: 0000000000000005
[28016.864598] RBP: 00007f0ee60336a8 R08: 00007ffdfdfa51b0 R09: 0000000002bb932e
[28016.866016] R10: 0000000002bb932e R11: 0000000000000293 R12: 0000000000000004
[28016.867698] R13: 000055ac19cebec0 R14: 0000000000000000 R15: 000055ac19cebee8

omarkilani commented 3 years ago

One interesting thing to note here is while sync'ed writes to the filesystem hang and e.g. zpool sync hangs forever:

[root@instance-20210606-1831 ~]# ps auxwww | grep zpool
root     2976364  0.0  0.0 270084  6252 pts/1    D    05:33   0:00 zpool sync
[root@instance-20210606-1831 ~]# strace -p 2976364
strace: Process 2976364 attached
^C^C^C^C^C^C

The pg_dump job is still making progress:

              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        2.65T  3.16T  9.24K      0   152M      0
  sdb        230G   266G    728      0  15.2M      0
  sdc        247G   249G    735      0  12.6M      0
  sdd        246G   250G   1022      0  16.6M      0
  sde        247G   249G    926      0  14.1M      0
  sdf        184G   312G    640      0  8.23M      0
  sdg        252G   244G    669      0  14.3M      0
  sdh        190G   306G    691      0  9.28M      0
  sdi        258G   238G    830      0  14.0M      0
  sdm        189G   307G    722      0  10.9M      0
  sdj        244G   252G    887      0  13.5M      0
  sdk        241G   255G    933      0  12.2M      0
  sdl        190G   306G    671      0  10.7M      0
----------  -----  -----  -----  -----  -----  -----
              capacity     operations     bandwidth 
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        2.65T  3.16T  9.25K      0  75.4M      0
  sdb        230G   266G    930      0  7.19M      0
  sdc        247G   249G    922      0  6.76M      0
  sdd        246G   250G    986      0  7.86M      0
  sde        247G   249G    978      0  7.27M      0
  sdf        184G   312G    699      0  5.76M      0
  sdg        252G   244G    782      0  6.63M      0
  sdh        190G   306G    739      0  6.39M      0
  sdi        258G   238G    932      0  6.87M      0
  sdm        189G   307G    537      0  4.45M      0
  sdj        244G   252G    771      0  5.81M      0
  sdk        241G   255G    811      0  6.91M      0
  sdl        190G   306G    374      0  3.50M      0
----------  -----  -----  -----  -----  -----  -----

rincebrain commented 3 years ago

Something I meant to mention before but apparently did not is, since the top of your stack seems like it's in the writeback throttle Linux bolted on, you could try turning it off and seeing if you can still reproduce.

(I should warn you that I am not at all familiar with the machinery involved here, and am doing simple association. This might change nothing relevant. )

omarkilani commented 3 years ago

Alright, I'll try that. It looks sort of similar to that issue but not quite since that reporter's reads were blocked. FWIW the pg_dump completed successfully even though writers were deadlocked.

I upgraded to rc7 just to rule out magic and now I know I can use fio it's much quicker, as the walwriter only writes very sporadically (I think that was the issue in getting a repro before).

It should be possible to simulate the pg_bench workload with fio, too, since that's basically just sequential reads.

I'm not sure if it matters that it's written out to XFS kernel-wise but I don't think so since it goes through userspace to get processed first.

omarkilani commented 3 years ago

Alright, rc7 is still affected.

It took 860 seconds to trigger the deadlock with fio, which is much faster than the 5 hours I was waiting before.

[  860.170253] INFO: task z_wr_iss_h:5802 blocked for more than 120 seconds.
[  860.182902]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[  860.184791] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  860.186977] z_wr_iss_h      D    0  5802      2 0x80004080
[  860.188514] Call Trace:
[  860.189296]  __schedule+0x2c4/0x700
[  860.190294]  ? elv_rb_del+0x1f/0x30
[  860.191303]  ? wbt_exit+0x30/0x30
[  860.192327]  ? __wbt_done+0x30/0x30
[  860.193355]  schedule+0x38/0xa0
[  860.194149]  io_schedule+0x12/0x40
[  860.195090]  rq_qos_wait+0xfa/0x170
[  860.195980]  ? karma_partition+0x1e0/0x1e0
[  860.197106]  ? wbt_exit+0x30/0x30
[  860.197993]  wbt_wait+0x99/0xe0
[  860.198782]  __rq_qos_throttle+0x23/0x30
[  860.199521]  blk_mq_make_request+0x12d/0x5b0
[  860.200646]  generic_make_request+0x2db/0x340
[  860.202048]  ? abd_to_buf+0xe/0x20 [zfs]
[  860.203299]  ? abd_bio_map_off+0x1da/0x210 [zfs]
[  860.204651]  submit_bio+0x3c/0x160
[  860.205683]  __vdev_disk_physio.constprop.0+0x342/0x430 [zfs]
[  860.207422]  vdev_disk_io_start+0x8d/0x370 [zfs]
[  860.208904]  zio_vdev_io_start+0xf8/0x340 [zfs]
[  860.210400]  zio_nowait+0xb4/0x1a0 [zfs]
[  860.211606]  vdev_mirror_io_start+0xa2/0x250 [zfs]
[  860.213140]  zio_vdev_io_start+0x2c2/0x340 [zfs]
[  860.214646]  zio_execute+0x81/0x120 [zfs]
[  860.215878]  taskq_thread+0x2f0/0x530 [spl]
[  860.217038]  ? wake_up_q+0x80/0x80
[  860.218107]  ? zio_gang_tree_free+0x60/0x60 [zfs]
[  860.219734]  ? taskq_thread_spawn+0x50/0x50 [spl]
[  860.221193]  kthread+0x116/0x130
[  860.222184]  ? kthread_flush_work_fn+0x10/0x10
[  860.223612]  ret_from_fork+0x22/0x40
[  860.224784] INFO: task txg_quiesce:6567 blocked for more than 120 seconds.
[  860.226765]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[  860.229344] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  860.231724] txg_quiesce     D    0  6567      2 0x80004080
[  860.233728] Call Trace:
[  860.234783]  __schedule+0x2c4/0x700
[  860.236223]  ? kfree+0x40b/0x450
[  860.237503]  schedule+0x38/0xa0
[  860.238733]  cv_wait_common+0xfb/0x130 [spl]
[  860.240377]  ? finish_wait+0x80/0x80
[  860.242076]  txg_quiesce+0x1d3/0x260 [zfs]
[  860.243547]  txg_quiesce_thread+0xf9/0x180 [zfs]
[  860.244818]  ? txg_quiesce+0x260/0x260 [zfs]
[  860.246047]  ? __thread_exit+0x20/0x20 [spl]
[  860.247275]  thread_generic_wrapper+0x6f/0x80 [spl]
[  860.248579]  kthread+0x116/0x130
[  860.249381]  ? kthread_flush_work_fn+0x10/0x10
[  860.250351]  ret_from_fork+0x22/0x40
[  860.251162] INFO: task fio:14621 blocked for more than 120 seconds.
[  860.252856]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[  860.255569] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  860.257599] fio             D    0 14621  14167 0x00004080
[  860.259532] Call Trace:
[  860.260568]  __schedule+0x2c4/0x700
[  860.261739]  schedule+0x38/0xa0
[  860.263046]  cv_wait_common+0xfb/0x130 [spl]
[  860.264652]  ? finish_wait+0x80/0x80
[  860.266004]  zil_commit_waiter+0xc0/0x150 [zfs]
[  860.267767]  zil_commit_impl+0x16f/0x240 [zfs]
[  860.269514]  zfs_fsync+0x70/0xf0 [zfs]
[  860.270989]  zpl_fsync+0x67/0xa0 [zfs]
[  860.272414]  ? __audit_syscall_exit+0x249/0x2a0
[  860.274050]  do_fsync+0x38/0x70
[  860.275349]  __x64_sys_fdatasync+0x13/0x20
[  860.276906]  do_syscall_64+0x5b/0x1a0
[  860.278401]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  860.280101] RIP: 0033:0x7f8e7a72056f
[  860.281525] Code: Unable to access opcode bytes at RIP 0x7f8e7a720545.
[  860.283600] RSP: 002b:00007fff3ab2b6f0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[  860.285792] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f8e7a72056f
[  860.287680] RDX: 0000000000000000 RSI: 000055ae9bc7eec0 RDI: 0000000000000005
[  860.289547] RBP: 00007f8e107e7000 R08: 00007fff3ab631b0 R09: 00000000001195dc
[  860.291429] R10: 00000000001195dc R11: 0000000000000293 R12: 0000000000000004
[  860.293318] R13: 000055ae9bc7eec0 R14: 0000000000000000 R15: 000055ae9bc7eee8

So that's kind of both a good thing and a bad thing. lol.

omarkilani commented 3 years ago

I can reliably repro with fio doing writes and pg_dump doing reads... so now I'm trying to come up with a fio script to simulate the pg_dump...

The thing to note here is that the issue isn't only about how fast or slow the disks are -- it's also about how much bandwidth you have to them and how saturated it is with a specific read/write pattern.

So, e.g., the reason it triggers on the Milan VM but not the Ice Lake VM is because, for whatever reason, the Milan VM has 6Gbps of storage bandwidth and the Ice Lake VM has 24Gbps, for the same CPU count.

That's probably a harder thing to trigger on locally attached disks because ZFS is usually slower than the PCIe/SAS/SATA bus, which is... maybe?... why it hasn't been reported too much.

omarkilani commented 3 years ago

@rincebrain the interesting thing about #12085 and this is that that one is stuck in wbt_wait while my backtraces are after a wbt_exit.

Could be the same thing of course. Just a slight difference.

youzhongyang commented 3 years ago

My understanding is that in stack trace, anything with ? can be ignored. Since it can be easily reproduced, it would be nice to know if it still reproduces after wbt is turned off. Thanks.

omarkilani commented 3 years ago

@youzhongyang I’m working my way towards testing with wbt off.

It would be nice to have a way to simulate the reads since no one else can probably get a repro otherwise…

rincebrain commented 3 years ago

@rincebrain the interesting thing about #12085 and this is that that one is stuck in wbt_wait while my backtraces are after a wbt_exit.

Could be the same thing of course. Just a slight difference.

Sure, I wasn't trying to suggest you were having the same issue, solely pointing to that example of what to set.

omarkilani commented 3 years ago

@rincebrain yup, got it. If I can’t repro with wbt off then…?

Is there a tool which generates data that is A.Bx compressible by ZFS’ lz4?

The data fio generates is too random and lz4 decompression doesn’t show up in perf top when running the fio read workload.

rincebrain commented 3 years ago

I'm not aware of one, but since AFAIK it's stock lz4 with few modifications that should affect the compression ratios, you could reduce this to the problem of finding data stock lz4 from 2013 (which is when the initial lz4 commit went into illumos, AFAICT) will compress as much as you like, then verify it does what you want.

edit: As to your other question, if you can't repro with WBT off, then it becomes a question of who's doing something wrong, WBT or ZFS, and who's going to do something about it.

omarkilani commented 3 years ago

Okay, I have a repro with just fio.

VM specs:

Shape:
VM.Standard.E4.Flex

OCPU Count: 
6

Network Bandwidth (Gbps):
6

Memory (GB):
96

Local Disk: 
Block Storage Only

Which turns into 12 of these:

processor   : 11
vendor_id   : AuthenticAMD
cpu family  : 25
model       : 1
model name  : AMD EPYC 7J13 64-Core Processor
stepping    : 1
microcode   : 0x1000065
cpu MHz     : 2545.214
cache size  : 512 KB
physical id : 0
siblings    : 12
core id     : 5
cpu cores   : 6
apicid      : 11
initial apicid  : 11
fpu     : yes
fpu_exception   : yes
cpuid level : 16
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core cpb ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip vaes vpclmulqdq rdpid arch_capabilities
bugs        : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips    : 5090.42
TLB size    : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Block volumes: 12x500GB with:

Target Performance:
Lower Cost (VPU:0)

Target IOPS:
1000 IOPS

Target Throughput:
120 MB/s

Kernel / zfs:

[root@instance-20210606-1831 fio]# zpool version
zfs-2.1.0-rc7
zfs-kmod-2.1.0-rc7

[root@instance-20210606-1831 fio]# uname -a
Linux instance-20210606-1831 4.18.0-305.3.1.el8_4.x86_64 #1 SMP Tue Jun 1 13:25:54 PDT 2021 x86_64 x86_64 x86_64 GNU/Linux

I ran this script to... wbt... more...

#!/bin/sh

for f in /sys/block/*/queue/wbt_lat_usec; do
    echo 9375 > $f
done

The zpool looks like this:

[root@instance-20210606-1831 ~]# zpool list -v
NAME        SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
tank       5.81T  3.16T  2.65T        -         -    29%    54%  1.00x    ONLINE  -
  sdl       496G   274G   222G        -         -    27%  55.3%      -    ONLINE
  sdb       496G   285G   211G        -         -    31%  57.5%      -    ONLINE
  sdc       496G   284G   212G        -         -    32%  57.3%      -    ONLINE
  sdi       496G   290G   206G        -         -    32%  58.4%      -    ONLINE
  sdg       496G   230G   266G        -         -    27%  46.3%      -    ONLINE
  sdh       496G   294G   202G        -         -    31%  59.3%      -    ONLINE
  sdn       496G   235G   261G        -         -    27%  47.5%      -    ONLINE
  sdk       496G   298G   198G        -         -    33%  60.1%      -    ONLINE
  sdj       496G   235G   261G        -         -    26%  47.4%      -    ONLINE
  sdd       496G   287G   209G        -         -    30%  57.8%      -    ONLINE
  sde       496G   283G   213G        -         -    31%  57.1%      -    ONLINE
  sdf       496G   239G   257G        -         -    28%  48.2%      -    ONLINE

Created with zpool create -o ashift=12 tank /dev/....

The zfs:

zfs create -o compression=lz4 -o xattr=sa -o atime=off -o logbias=throughput -o recordsize=8k -o primarycache=metadata tank/db

pg_walwriter.fio:

[global]
ioengine=sync
bs=8K
iodepth=1
fdatasync=1
group_reporting
time_based
runtime=60000
numjobs=12
name=raw-write
rw=write

[job1]
filename=/tank/db/pg_walwriter_001
size=16MB

pg_dump_read_big.fio

[global]
ioengine=sync
bs=8K
iodepth=128
direct=1
group_reporting
time_based
runtime=60000
name=raw-read
rw=read
size=512GB
numjobs=48
io_submit_mode=offload

[job1]
filename=/tank/db/pg_dump_read_big

Run them both and wait a while.

Note that the original deadlock happened with primarycache=all but I changed it to primarycache=metadata so the pg_dump simulation wouldn't hit ARC (data set size of the real Postgres was 2.8Tb compressed so it didn't fit into the 96GB of RAM, but enough of the 512GB did).

Here's the deadlock with just fio:

[24085.116306] INFO: task z_wr_iss_h:1007366 blocked for more than 120 seconds.
[24085.121243]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.122825] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.124256] z_wr_iss_h      D    0 1007366      2 0x80004080
[24085.125376] Call Trace:
[24085.125810]  __schedule+0x2c4/0x700
[24085.126540]  ? elv_rb_del+0x1f/0x30
[24085.127224]  ? wbt_exit+0x30/0x30
[24085.127874]  ? __wbt_done+0x30/0x30
[24085.128624]  schedule+0x38/0xa0
[24085.129166]  io_schedule+0x12/0x40
[24085.129872]  rq_qos_wait+0xfa/0x170
[24085.130559]  ? karma_partition+0x1e0/0x1e0
[24085.131311]  ? wbt_exit+0x30/0x30
[24085.131916]  wbt_wait+0x99/0xe0
[24085.132929]  __rq_qos_throttle+0x23/0x30
[24085.134020]  blk_mq_make_request+0x12d/0x5b0
[24085.135333]  generic_make_request+0x2db/0x340
[24085.136646]  ? abd_to_buf+0xe/0x20 [zfs]
[24085.137998]  ? abd_bio_map_off+0x1da/0x210 [zfs]
[24085.139362]  submit_bio+0x3c/0x160
[24085.140472]  __vdev_disk_physio.constprop.0+0x342/0x430 [zfs]
[24085.141983]  vdev_disk_io_start+0x8d/0x370 [zfs]
[24085.143240]  zio_vdev_io_start+0xf8/0x340 [zfs]
[24085.144423]  zio_nowait+0xb4/0x1a0 [zfs]
[24085.145617]  vdev_mirror_io_start+0xa2/0x250 [zfs]
[24085.146942]  zio_vdev_io_start+0x2c2/0x340 [zfs]
[24085.148370]  zio_execute+0x81/0x120 [zfs]
[24085.149879]  taskq_thread+0x2f0/0x530 [spl]
[24085.151034]  ? wake_up_q+0x80/0x80
[24085.152263]  ? zio_gang_tree_free+0x60/0x60 [zfs]
[24085.153474]  ? taskq_thread_spawn+0x50/0x50 [spl]
[24085.154816]  kthread+0x116/0x130
[24085.156182]  ? kthread_flush_work_fn+0x10/0x10
[24085.157547]  ret_from_fork+0x22/0x40
[24085.158590] INFO: task txg_quiesce:1008466 blocked for more than 120 seconds.
[24085.160272]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.162169] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.164169] txg_quiesce     D    0 1008466      2 0x80004080
[24085.165870] Call Trace:
[24085.166868]  __schedule+0x2c4/0x700
[24085.168191]  ? kfree+0x40b/0x450
[24085.169402]  schedule+0x38/0xa0
[24085.170501]  cv_wait_common+0xfb/0x130 [spl]
[24085.171681]  ? finish_wait+0x80/0x80
[24085.172687]  txg_quiesce+0x1d3/0x260 [zfs]
[24085.173762]  txg_quiesce_thread+0xf9/0x180 [zfs]
[24085.174886]  ? txg_quiesce+0x260/0x260 [zfs]
[24085.176005]  ? __thread_exit+0x20/0x20 [spl]
[24085.177174]  thread_generic_wrapper+0x6f/0x80 [spl]
[24085.178574]  kthread+0x116/0x130
[24085.179562]  ? kthread_flush_work_fn+0x10/0x10
[24085.180723]  ret_from_fork+0x22/0x40
[24085.181859] INFO: task fio:1009087 blocked for more than 120 seconds.
[24085.183619]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.185671] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.187461] fio             D    0 1009087 1009085 0x00004080
[24085.188938] Call Trace:
[24085.189956]  __schedule+0x2c4/0x700
[24085.191217]  schedule+0x38/0xa0
[24085.193105]  cv_wait_common+0xfb/0x130 [spl]
[24085.194570]  ? finish_wait+0x80/0x80
[24085.196056]  zil_commit_waiter+0xc0/0x150 [zfs]
[24085.197269]  zil_commit_impl+0x16f/0x240 [zfs]
[24085.198554]  zfs_fsync+0x70/0xf0 [zfs]
[24085.199647]  zpl_fsync+0x67/0xa0 [zfs]
[24085.200794]  ? __audit_syscall_exit+0x249/0x2a0
[24085.202116]  do_fsync+0x38/0x70
[24085.203126]  __x64_sys_fdatasync+0x13/0x20
[24085.204247]  do_syscall_64+0x5b/0x1a0
[24085.205522]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.206915] RIP: 0033:0x7f69eec8156f
[24085.207900] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.209406] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.211347] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.213173] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.214908] RBP: 00007f6984d48000 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.216507] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.218198] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.219861] INFO: task fio:1009088 blocked for more than 120 seconds.
[24085.221460]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.223170] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.225017] fio             D    0 1009088 1009085 0x00004080
[24085.226427] Call Trace:
[24085.227404]  __schedule+0x2c4/0x700
[24085.228645]  ? mutex_lock+0xe/0x30
[24085.229920]  schedule+0x38/0xa0
[24085.231061]  cv_wait_common+0xfb/0x130 [spl]
[24085.232426]  ? finish_wait+0x80/0x80
[24085.233585]  zil_commit_waiter+0xc0/0x150 [zfs]
[24085.234935]  zil_commit_impl+0x16f/0x240 [zfs]
[24085.235912]  zfs_fsync+0x70/0xf0 [zfs]
[24085.236891]  zpl_fsync+0x67/0xa0 [zfs]
[24085.237953]  ? __audit_syscall_exit+0x249/0x2a0
[24085.239352]  do_fsync+0x38/0x70
[24085.240481]  __x64_sys_fdatasync+0x13/0x20
[24085.241822]  do_syscall_64+0x5b/0x1a0
[24085.242958]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.244345] RIP: 0033:0x7f69eec8156f
[24085.245351] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.246920] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.248751] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007f69eec8156f
[24085.250300] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000006
[24085.252100] RBP: 00007f6984d8ca18 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.253846] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.255608] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.257202] INFO: task fio:1009089 blocked for more than 120 seconds.
[24085.258794]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.260737] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.262414] fio             D    0 1009089 1009085 0x00004080
[24085.263780] Call Trace:
[24085.264590]  __schedule+0x2c4/0x700
[24085.265583]  ? __switch_to_asm+0x41/0x70
[24085.266645]  schedule+0x38/0xa0
[24085.267519]  cv_wait_common+0xfb/0x130 [spl]
[24085.268590]  ? finish_wait+0x80/0x80
[24085.269576]  zil_commit_waiter+0xc0/0x150 [zfs]
[24085.270758]  zil_commit_impl+0x16f/0x240 [zfs]
[24085.271906]  zfs_fsync+0x70/0xf0 [zfs]
[24085.272944]  zpl_fsync+0x67/0xa0 [zfs]
[24085.273964]  ? __audit_syscall_exit+0x249/0x2a0
[24085.275168]  do_fsync+0x38/0x70
[24085.276135]  __x64_sys_fdatasync+0x13/0x20
[24085.277285]  do_syscall_64+0x5b/0x1a0
[24085.278336]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.279732] RIP: 0033:0x7f69eec8156f
[24085.280867] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.282573] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.284350] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.286039] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.287784] RBP: 00007f6984dd1430 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.289661] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.291265] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.292910] INFO: task fio:1009090 blocked for more than 120 seconds.
[24085.294367]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.296205] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.298011] fio             D    0 1009090 1009085 0x00004080
[24085.299397] Call Trace:
[24085.300185]  __schedule+0x2c4/0x700
[24085.301177]  schedule+0x38/0xa0
[24085.302137]  cv_wait_common+0xfb/0x130 [spl]
[24085.303209]  ? finish_wait+0x80/0x80
[24085.304283]  zil_commit_waiter+0xc0/0x150 [zfs]
[24085.305520]  zil_commit_impl+0x16f/0x240 [zfs]
[24085.306681]  zfs_fsync+0x70/0xf0 [zfs]
[24085.307777]  zpl_fsync+0x67/0xa0 [zfs]
[24085.308838]  ? __audit_syscall_exit+0x249/0x2a0
[24085.309985]  do_fsync+0x38/0x70
[24085.310926]  __x64_sys_fdatasync+0x13/0x20
[24085.312021]  do_syscall_64+0x5b/0x1a0
[24085.313062]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.314338] RIP: 0033:0x7f69eec8156f
[24085.315316] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.316941] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.318960] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.320865] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.322509] RBP: 00007f6984e15e48 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.324074] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.325659] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.327325] INFO: task fio:1009091 blocked for more than 120 seconds.
[24085.329045]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.330867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.332518] fio             D    0 1009091 1009085 0x00004080
[24085.333926] Call Trace:
[24085.334765]  __schedule+0x2c4/0x700
[24085.335834]  schedule+0x38/0xa0
[24085.336778]  cv_wait_common+0xfb/0x130 [spl]
[24085.337943]  ? finish_wait+0x80/0x80
[24085.339007]  zil_commit_waiter+0xc0/0x150 [zfs]
[24085.340199]  zil_commit_impl+0x16f/0x240 [zfs]
[24085.341454]  zfs_fsync+0x70/0xf0 [zfs]
[24085.342456]  zpl_fsync+0x67/0xa0 [zfs]
[24085.343408]  ? __audit_syscall_exit+0x249/0x2a0
[24085.344499]  do_fsync+0x38/0x70
[24085.345375]  __x64_sys_fdatasync+0x13/0x20
[24085.346369]  do_syscall_64+0x5b/0x1a0
[24085.347315]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.348502] RIP: 0033:0x7f69eec8156f
[24085.349457] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.350972] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.352593] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.354028] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.355646] RBP: 00007f6984e5a860 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.357349] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.359321] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.360919] INFO: task fio:1009092 blocked for more than 120 seconds.
[24085.362466]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.364545] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.366632] fio             D    0 1009092 1009085 0x00004080
[24085.368335] Call Trace:
[24085.369415]  __schedule+0x2c4/0x700
[24085.370637]  schedule+0x38/0xa0
[24085.371752]  cv_wait_common+0xfb/0x130 [spl]
[24085.373110]  ? finish_wait+0x80/0x80
[24085.374341]  zil_commit_waiter+0xc0/0x150 [zfs]
[24085.375823]  zil_commit_impl+0x16f/0x240 [zfs]
[24085.377294]  zfs_fsync+0x70/0xf0 [zfs]
[24085.378551]  zpl_fsync+0x67/0xa0 [zfs]
[24085.379544]  ? __audit_syscall_exit+0x249/0x2a0
[24085.380843]  do_fsync+0x38/0x70
[24085.381916]  __x64_sys_fdatasync+0x13/0x20
[24085.382987]  do_syscall_64+0x5b/0x1a0
[24085.384066]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.385376] RIP: 0033:0x7f69eec8156f
[24085.386327] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.387778] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.389504] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.391146] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.392772] RBP: 00007f6984e9f278 R08: 00007fffc3d641b0 R09: 0000000002c81846
[24085.394302] R10: 0000000002c81846 R11: 0000000000000293 R12: 0000000000000004
[24085.395859] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.397391] INFO: task fio:1009093 blocked for more than 120 seconds.
[24085.398844]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.400715] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.402517] fio             D    0 1009093 1009085 0x00004080
[24085.403948] Call Trace:
[24085.404786]  __schedule+0x2c4/0x700
[24085.405797]  ? mutex_lock+0xe/0x30
[24085.406814]  schedule+0x38/0xa0
[24085.407740]  cv_wait_common+0xfb/0x130 [spl]
[24085.408953]  ? finish_wait+0x80/0x80
[24085.410072]  zil_commit_waiter+0xc0/0x150 [zfs]
[24085.411265]  zil_commit_impl+0x16f/0x240 [zfs]
[24085.412488]  zfs_fsync+0x70/0xf0 [zfs]
[24085.413673]  zpl_fsync+0x67/0xa0 [zfs]
[24085.414626]  ? __audit_syscall_exit+0x249/0x2a0
[24085.415786]  do_fsync+0x38/0x70
[24085.416731]  __x64_sys_fdatasync+0x13/0x20
[24085.417898]  do_syscall_64+0x5b/0x1a0
[24085.418929]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.420225] RIP: 0033:0x7f69eec8156f
[24085.421184] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.422748] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.424491] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.426109] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.427732] RBP: 00007f6984ee3c90 R08: 00007fffc3d641b0 R09: 0000000002c81846
[24085.429383] R10: 0000000002c81846 R11: 0000000000000293 R12: 0000000000000004
[24085.430976] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8
[24085.432658] INFO: task fio:1009094 blocked for more than 120 seconds.
[24085.434257]       Tainted: P           OE    --------- -t - 4.18.0-305.3.1.el8_4.x86_64 #1
[24085.436123] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[24085.437790] fio             D    0 1009094 1009085 0x00004080
[24085.439133] Call Trace:
[24085.439942]  __schedule+0x2c4/0x700
[24085.440949]  ? mutex_lock+0xe/0x30
[24085.441998]  schedule+0x38/0xa0
[24085.443003]  cv_wait_common+0xfb/0x130 [spl]
[24085.444222]  ? finish_wait+0x80/0x80
[24085.445235]  zil_commit_waiter+0xc0/0x150 [zfs]
[24085.446736]  zil_commit_impl+0x16f/0x240 [zfs]
[24085.447909]  zfs_fsync+0x70/0xf0 [zfs]
[24085.449051]  zpl_fsync+0x67/0xa0 [zfs]
[24085.450146]  ? __audit_syscall_exit+0x249/0x2a0
[24085.451380]  do_fsync+0x38/0x70
[24085.452272]  __x64_sys_fdatasync+0x13/0x20
[24085.453341]  do_syscall_64+0x5b/0x1a0
[24085.454420]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[24085.455657] RIP: 0033:0x7f69eec8156f
[24085.456645] Code: Unable to access opcode bytes at RIP 0x7f69eec81545.
[24085.458223] RSP: 002b:00007fffc3c4cda0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[24085.460420] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f69eec8156f
[24085.462074] RDX: 0000000000000000 RSI: 0000558d55e43ec0 RDI: 0000000000000005
[24085.463679] RBP: 00007f6984f286a8 R08: 00007fffc3d641b0 R09: 0000000002c81844
[24085.465485] R10: 0000000002c81844 R11: 0000000000000293 R12: 0000000000000004
[24085.467321] R13: 0000558d55e43ec0 R14: 0000000000000000 R15: 0000558d55e43ee8

This one took roughly 3000 seconds to trigger.

Next I'll disable wbt. :)

omarkilani commented 3 years ago

I'm running two tests in parallel at the moment.

Kernel 4.18.0-305.3.1.el8_4.x86_64 with wbt_lat_usec = 0.
Kernel 5.4.17-2102.201.3.el8uek.x86_64 with wbt_lat_usec = 9375.

So far, test 1 has been running for almost 6 hours with quite a bit of I/O load:

 08:29:21 up  5:52,  1 user,  load average: 217.34, 217.99, 217.35

And hasn't deadlocked. I'm running the two fio workloads and the original pg_dump at the same time.

Looks promising so far.

Edit: still looks good:

[root@instance-20210606-1831 ~]# uptime
 12:47:25 up 10:10,  1 user,  load average: 219.59, 218.13, 218.83
[root@instance-20210606-1831 ~]# dmesg
[root@instance-20210606-1831 ~]#

omarkilani commented 3 years ago

On 5.4.17-2102.201.3.el8uek.x86_64, the deadlock is similarly easy to trigger with pg_dump and fio running with wbt_lat_usec = 9375.

[ 1967.106950] INFO: task z_wr_iss_h:6151 blocked for more than 122 seconds.
[ 1967.111868]       Tainted: P           OE     5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.113351] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.115027] z_wr_iss_h      D    0  6151      2 0x80004080
[ 1967.116258] Call Trace:
[ 1967.117029]  __schedule+0x2ab/0x606
[ 1967.117910]  ? wbt_exit+0x30/0x2f
[ 1967.118720]  ? __wbt_done+0x40/0x35
[ 1967.119566]  schedule+0x33/0x9a
[ 1967.120365]  io_schedule+0x16/0x39
[ 1967.121206]  rq_qos_wait+0xf7/0x16c
[ 1967.122043]  ? karma_partition+0x210/0x204
[ 1967.123034]  ? wbt_exit+0x30/0x2f
[ 1967.123832]  wbt_wait+0x9f/0xdb
[ 1967.124624]  __rq_qos_throttle+0x28/0x36
[ 1967.125535]  blk_mq_make_request+0x11f/0x5b1
[ 1967.126514]  generic_make_request+0xcf/0x30b
[ 1967.127484]  submit_bio+0x46/0x1bf
[ 1967.128488]  __vdev_disk_physio.constprop.0+0x354/0x460 [zfs]
[ 1967.129731]  vdev_disk_io_start+0x96/0x380 [zfs]
[ 1967.130776]  zio_vdev_io_start+0x101/0x360 [zfs]
[ 1967.131814]  zio_nowait+0xc3/0x1a0 [zfs]
[ 1967.132749]  vdev_mirror_io_start+0xaf/0x280 [zfs]
[ 1967.133814]  zio_vdev_io_start+0x2d4/0x360 [zfs]
[ 1967.134823]  ? tsd_hash_search.isra.0+0x71/0xa0 [spl]
[ 1967.135929]  zio_execute+0x89/0x130 [zfs]
[ 1967.136824]  taskq_thread+0x31a/0x5a0 [spl]
[ 1967.137775]  ? wake_up_q+0x80/0x77
[ 1967.138619]  ? zio_gang_tree_free+0x70/0x70 [zfs]
[ 1967.139655]  kthread+0x120/0x136
[ 1967.140450]  ? taskq_thread_spawn+0x60/0x60 [spl]
[ 1967.141469]  ? kthread_park+0x90/0x82
[ 1967.142484]  ret_from_fork+0x2b/0x36
[ 1967.143315] INFO: task txg_quiesce:6936 blocked for more than 122 seconds.
[ 1967.144653]       Tainted: P           OE     5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.146095] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.147564] txg_quiesce     D    0  6936      2 0x80004080
[ 1967.148801] Call Trace:
[ 1967.149514]  __schedule+0x2ab/0x606
[ 1967.150335]  schedule+0x33/0x9a
[ 1967.151109]  cv_wait_common+0x100/0x130 [spl]
[ 1967.152066]  ? finish_wait+0x80/0x78
[ 1967.153014]  __cv_wait+0x15/0x20 [spl]
[ 1967.153917]  txg_quiesce+0x1b9/0x250 [zfs]
[ 1967.154854]  txg_quiesce_thread+0xfd/0x180 [zfs]
[ 1967.155900]  ? txg_quiesce+0x250/0x250 [zfs]
[ 1967.157010]  thread_generic_wrapper+0x79/0x90 [spl]
[ 1967.158170]  kthread+0x120/0x136
[ 1967.159022]  ? __thread_exit+0x20/0x20 [spl]
[ 1967.160119]  ? kthread_park+0x90/0x82
[ 1967.161048]  ret_from_fork+0x2b/0x36
[ 1967.162010] INFO: task postmaster:493613 blocked for more than 122 seconds.
[ 1967.163553]       Tainted: P           OE     5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.165162] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.166620] postmaster      D    0 493613   8157 0x00004084
[ 1967.167750] Call Trace:
[ 1967.168418]  __schedule+0x2ab/0x606
[ 1967.169251]  schedule+0x33/0x9a
[ 1967.170025]  cv_wait_common+0x100/0x130 [spl]
[ 1967.170994]  ? finish_wait+0x80/0x78
[ 1967.171915]  __cv_wait+0x15/0x20 [spl]
[ 1967.172927]  zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.174031]  zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.175156]  zil_commit+0x40/0x60 [zfs]
[ 1967.176177]  zfs_fsync+0x7a/0x100 [zfs]
[ 1967.177190]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.178076]  vfs_fsync_range+0x48/0x7c
[ 1967.179066]  ? __audit_syscall_exit+0x24a/0x29c
[ 1967.180145]  do_fsync+0x3d/0x6d
[ 1967.180979]  __x64_sys_fdatasync+0x17/0x1b
[ 1967.181988]  do_syscall_64+0x60/0x1cb
[ 1967.182926]  entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.184207] RIP: 0033:0x7f64d6222548
[ 1967.185119] Code: Bad RIP value.
[ 1967.185980] RSP: 002b:00007ffd8a9cd6f8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
[ 1967.187893] RAX: ffffffffffffffda RBX: 0000000000d44874 RCX: 00007f64d6222548
[ 1967.189340] RDX: 0000000000761fff RSI: 0000000000886343 RDI: 0000000000000003
[ 1967.190698] RBP: 0000000000886343 R08: 00007ffd8a9cd7c8 R09: 00007ffd8a9cd7c0
[ 1967.192130] R10: 0000000060c4e823 R11: 0000000000000246 R12: 00007f64d8e11d00
[ 1967.193483] R13: 0000886343762000 R14: 0000000000000000 R15: 00007f5c92a1a000
[ 1967.194854] INFO: task fio:502104 blocked for more than 122 seconds.
[ 1967.196121]       Tainted: P           OE     5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.197599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.199142] fio             D    0 502104      1 0x00004084
[ 1967.200268] Call Trace:
[ 1967.200932]  __schedule+0x2ab/0x606
[ 1967.201766]  schedule+0x33/0x9a
[ 1967.202548]  cv_wait_common+0x100/0x130 [spl]
[ 1967.203514]  ? finish_wait+0x80/0x78
[ 1967.204360]  __cv_wait+0x15/0x20 [spl]
[ 1967.205260]  zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.206270]  zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.207254]  zil_commit+0x40/0x60 [zfs]
[ 1967.208143]  zfs_fsync+0x7a/0x100 [zfs]
[ 1967.209086]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.209944]  vfs_fsync_range+0x48/0x7c
[ 1967.210820]  ? __audit_syscall_exit+0x24a/0x29c
[ 1967.211820]  do_fsync+0x3d/0x6d
[ 1967.212594]  __x64_sys_fdatasync+0x17/0x1b
[ 1967.213508]  do_syscall_64+0x60/0x1cb
[ 1967.214356]  entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.215405] RIP: 0033:0x7fd7622f956f
[ 1967.216218] Code: Bad RIP value.
[ 1967.217046] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.218488] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fd7622f956f
[ 1967.219846] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000005
[ 1967.221217] RBP: 00007fd6f83c0000 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.222584] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.224054] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.225489] INFO: task fio:502105 blocked for more than 122 seconds.
[ 1967.226882]       Tainted: P           OE     5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.228443] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.230077] fio             D    0 502105      1 0x00004084
[ 1967.231238] Call Trace:
[ 1967.231981]  __schedule+0x2ab/0x606
[ 1967.232806]  schedule+0x33/0x9a
[ 1967.233606]  cv_wait_common+0x100/0x130 [spl]
[ 1967.234623]  ? finish_wait+0x80/0x78
[ 1967.235493]  __cv_wait+0x15/0x20 [spl]
[ 1967.236414]  zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.237448]  zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.238502]  zil_commit+0x40/0x60 [zfs]
[ 1967.239420]  zfs_fsync+0x7a/0x100 [zfs]
[ 1967.240374]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.241246]  vfs_fsync_range+0x48/0x7c
[ 1967.242094]  ? __audit_syscall_exit+0x24a/0x29c
[ 1967.243060]  do_fsync+0x3d/0x6d
[ 1967.243810]  __x64_sys_fdatasync+0x17/0x1b
[ 1967.244733]  do_syscall_64+0x60/0x1cb
[ 1967.245577]  entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.246607] RIP: 0033:0x7fd7622f956f
[ 1967.247439] Code: Bad RIP value.
[ 1967.248199] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.249669] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7622f956f
[ 1967.251120] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000006
[ 1967.252501] RBP: 00007fd6f8404a18 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.254039] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.255457] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.256846] INFO: task fio:502110 blocked for more than 123 seconds.
[ 1967.258118]       Tainted: P           OE     5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.259535] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.261270] fio             D    0 502110      1 0x00004084
[ 1967.262509] Call Trace:
[ 1967.263195]  __schedule+0x2ab/0x606
[ 1967.264070]  schedule+0x33/0x9a
[ 1967.265101]  cv_wait_common+0x100/0x130 [spl]
[ 1967.266128]  ? finish_wait+0x80/0x78
[ 1967.267099]  __cv_wait+0x15/0x20 [spl]
[ 1967.268148]  zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.269174]  zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.270278]  zil_commit+0x40/0x60 [zfs]
[ 1967.271222]  zfs_fsync+0x7a/0x100 [zfs]
[ 1967.272237]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.273298]  vfs_fsync_range+0x48/0x7c
[ 1967.274239]  ? __audit_syscall_exit+0x24a/0x29c
[ 1967.275284]  do_fsync+0x3d/0x6d
[ 1967.276136]  __x64_sys_fdatasync+0x17/0x1b
[ 1967.277252]  do_syscall_64+0x60/0x1cb
[ 1967.278143]  entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.279243] RIP: 0033:0x7fd7622f956f
[ 1967.280093] Code: Bad RIP value.
[ 1967.280831] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.282296] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7622f956f
[ 1967.283712] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000006
[ 1967.285124] RBP: 00007fd6f8449430 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.286777] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.288481] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.289893] INFO: task fio:502113 blocked for more than 123 seconds.
[ 1967.291141]       Tainted: P           OE     5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.292641] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.294245] fio             D    0 502113      1 0x00004084
[ 1967.295476] Call Trace:
[ 1967.296190]  __schedule+0x2ab/0x606
[ 1967.297050]  schedule+0x33/0x9a
[ 1967.297864]  cv_wait_common+0x100/0x130 [spl]
[ 1967.298880]  ? finish_wait+0x80/0x78
[ 1967.299728]  __cv_wait+0x15/0x20 [spl]
[ 1967.300714]  zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.301828]  zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.303008]  zil_commit+0x40/0x60 [zfs]
[ 1967.304121]  zfs_fsync+0x7a/0x100 [zfs]
[ 1967.305095]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.306068]  vfs_fsync_range+0x48/0x7c
[ 1967.306991]  ? __audit_syscall_exit+0x24a/0x29c
[ 1967.308015]  do_fsync+0x3d/0x6d
[ 1967.308812]  __x64_sys_fdatasync+0x17/0x1b
[ 1967.309719]  do_syscall_64+0x60/0x1cb
[ 1967.310808]  entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.311904] RIP: 0033:0x7fd7622f956f
[ 1967.312723] Code: Bad RIP value.
[ 1967.313523] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.315026] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fd7622f956f
[ 1967.316466] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000005
[ 1967.317884] RBP: 00007fd6f848de48 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.319332] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.320800] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.322315] INFO: task fio:502120 blocked for more than 123 seconds.
[ 1967.323674]       Tainted: P           OE     5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.325222] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.326786] fio             D    0 502120      1 0x00004084
[ 1967.328066] Call Trace:
[ 1967.328769]  __schedule+0x2ab/0x606
[ 1967.329669]  schedule+0x33/0x9a
[ 1967.330559]  cv_wait_common+0x100/0x130 [spl]
[ 1967.331565]  ? finish_wait+0x80/0x78
[ 1967.332482]  __cv_wait+0x15/0x20 [spl]
[ 1967.333596]  zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.334739]  zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.335768]  zil_commit+0x40/0x60 [zfs]
[ 1967.336780]  zfs_fsync+0x7a/0x100 [zfs]
[ 1967.337793]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.338724]  vfs_fsync_range+0x48/0x7c
[ 1967.339713]  ? __audit_syscall_exit+0x24a/0x29c
[ 1967.341104]  do_fsync+0x3d/0x6d
[ 1967.341963]  __x64_sys_fdatasync+0x17/0x1b
[ 1967.342922]  do_syscall_64+0x60/0x1cb
[ 1967.343783]  entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.345058] RIP: 0033:0x7fd7622f956f
[ 1967.346054] Code: Bad RIP value.
[ 1967.346827] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.348378] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fd7622f956f
[ 1967.349846] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000005
[ 1967.351267] RBP: 00007fd6f84d2860 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.352656] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.354082] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.355475] INFO: task fio:502124 blocked for more than 123 seconds.
[ 1967.356760]       Tainted: P           OE     5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.358204] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.359687] fio             D    0 502124      1 0x00004084
[ 1967.360819] Call Trace:
[ 1967.361539]  __schedule+0x2ab/0x606
[ 1967.362386]  schedule+0x33/0x9a
[ 1967.363206]  cv_wait_common+0x100/0x130 [spl]
[ 1967.364178]  ? finish_wait+0x80/0x78
[ 1967.365071]  __cv_wait+0x15/0x20 [spl]
[ 1967.366019]  zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.367113]  zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.368161]  zil_commit+0x40/0x60 [zfs]
[ 1967.369148]  zfs_fsync+0x7a/0x100 [zfs]
[ 1967.370068]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.371064]  vfs_fsync_range+0x48/0x7c
[ 1967.371967]  ? __audit_syscall_exit+0x24a/0x29c
[ 1967.372938]  do_fsync+0x3d/0x6d
[ 1967.373688]  __x64_sys_fdatasync+0x17/0x1b
[ 1967.374626]  do_syscall_64+0x60/0x1cb
[ 1967.375494]  entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.376537] RIP: 0033:0x7fd7622f956f
[ 1967.377372] Code: Bad RIP value.
[ 1967.378147] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.379575] RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fd7622f956f
[ 1967.380958] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000006
[ 1967.382364] RBP: 00007fd6f8517278 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.383766] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.385168] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8
[ 1967.386571] INFO: task fio:502125 blocked for more than 123 seconds.
[ 1967.388023]       Tainted: P           OE     5.4.17-2102.201.3.el8uek.x86_64 #2
[ 1967.389533] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1967.391137] fio             D    0 502125      1 0x00004084
[ 1967.392325] Call Trace:
[ 1967.393012]  __schedule+0x2ab/0x606
[ 1967.393868]  schedule+0x33/0x9a
[ 1967.394647]  cv_wait_common+0x100/0x130 [spl]
[ 1967.395625]  ? finish_wait+0x80/0x78
[ 1967.396504]  __cv_wait+0x15/0x20 [spl]
[ 1967.397443]  zil_commit_waiter+0xbf/0x150 [zfs]
[ 1967.398476]  zil_commit_impl+0x17b/0x250 [zfs]
[ 1967.399563]  zil_commit+0x40/0x60 [zfs]
[ 1967.400503]  zfs_fsync+0x7a/0x100 [zfs]
[ 1967.401466]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1967.402386]  vfs_fsync_range+0x48/0x7c
[ 1967.403310]  ? __audit_syscall_exit+0x24a/0x29c
[ 1967.404331]  do_fsync+0x3d/0x6d
[ 1967.405136]  __x64_sys_fdatasync+0x17/0x1b
[ 1967.406168]  do_syscall_64+0x60/0x1cb
[ 1967.407037]  entry_SYSCALL_64_after_hwframe+0x170/0x0
[ 1967.408122] RIP: 0033:0x7fd7622f956f
[ 1967.408963] Code: Bad RIP value.
[ 1967.409735] RSP: 002b:00007ffeb0e698e0 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1967.411228] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fd7622f956f
[ 1967.412743] RDX: 0000000000000000 RSI: 000055b51875cec0 RDI: 0000000000000005
[ 1967.414198] RBP: 00007fd6f855bc90 R08: 000055b51875ced8 R09: 000055b51875ced0
[ 1967.415645] R10: 00000000000006df R11: 0000000000000293 R12: 0000000000000004
[ 1967.417179] R13: 000055b51875cec0 R14: 0000000000000000 R15: 000055b51875cee8

omarkilani commented 3 years ago

I went through every revision of blk-wbt.c to see when RH merged the code into their kernel and it's as close to 5.4.1 as you can get but the core locking logic changed between 4.x and 5.0.1. Here's the diff from the current 5.13.0 rc (5.13.0-rc5) -- no functional difference apart from debugfs stuff:

--- 4.18.0-305.3.1/blk-wbt.c    2021-06-12 15:53:53.714494969 +0000
+++ 5.13.0-rc5/blk-wbt.c    2021-06-12 16:16:54.000000000 +0000
@@ -1,3 +1,4 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * buffered writeback throttling. loosely based on CoDel. We can't drop
  * packets for IO scheduling, so the logic is something like this:
@@ -517,7 +518,7 @@
    rq_qos_wait(rqw, &data, wbt_inflight_cb, wbt_cleanup_cb);
 }

-static inline bool wbt_should_throttle(struct rq_wb *rwb, struct bio *bio)
+static inline bool wbt_should_throttle(struct bio *bio)
 {
    switch (bio_op(bio)) {
    case REQ_OP_WRITE:
@@ -527,7 +528,7 @@
        if ((bio->bi_opf & (REQ_SYNC | REQ_IDLE)) ==
            (REQ_SYNC | REQ_IDLE))
            return false;
-       /* fallthrough */
+       fallthrough;
    case REQ_OP_DISCARD:
        return true;
    default:
@@ -544,7 +545,7 @@

    if (bio_op(bio) == REQ_OP_READ) {
        flags = WBT_READ;
-   } else if (wbt_should_throttle(rwb, bio)) {
+   } else if (wbt_should_throttle(bio)) {
        if (current_is_kswapd())
            flags |= WBT_KSWAPD;
        if (bio_op(bio) == REQ_OP_DISCARD)
@@ -591,7 +592,7 @@
    rq->wbt_flags |= bio_to_wbt_flags(rwb, bio);
 }

-void wbt_issue(struct rq_qos *rqos, struct request *rq)
+static void wbt_issue(struct rq_qos *rqos, struct request *rq)
 {
    struct rq_wb *rwb = RQWB(rqos);

@@ -611,7 +612,7 @@
    }
 }

-void wbt_requeue(struct rq_qos *rqos, struct request *rq)
+static void wbt_requeue(struct rq_qos *rqos, struct request *rq)
 {
    struct rq_wb *rwb = RQWB(rqos);
    if (!rwb_enabled(rwb))
@@ -706,6 +707,94 @@
 }
 EXPORT_SYMBOL_GPL(wbt_disable_default);

+#ifdef CONFIG_BLK_DEBUG_FS
+static int wbt_curr_win_nsec_show(void *data, struct seq_file *m)
+{
+   struct rq_qos *rqos = data;
+   struct rq_wb *rwb = RQWB(rqos);
+
+   seq_printf(m, "%llu\n", rwb->cur_win_nsec);
+   return 0;
+}
+
+static int wbt_enabled_show(void *data, struct seq_file *m)
+{
+   struct rq_qos *rqos = data;
+   struct rq_wb *rwb = RQWB(rqos);
+
+   seq_printf(m, "%d\n", rwb->enable_state);
+   return 0;
+}
+
+static int wbt_id_show(void *data, struct seq_file *m)
+{
+   struct rq_qos *rqos = data;
+
+   seq_printf(m, "%u\n", rqos->id);
+   return 0;
+}
+
+static int wbt_inflight_show(void *data, struct seq_file *m)
+{
+   struct rq_qos *rqos = data;
+   struct rq_wb *rwb = RQWB(rqos);
+   int i;
+
+   for (i = 0; i < WBT_NUM_RWQ; i++)
+       seq_printf(m, "%d: inflight %d\n", i,
+              atomic_read(&rwb->rq_wait[i].inflight));
+   return 0;
+}
+
+static int wbt_min_lat_nsec_show(void *data, struct seq_file *m)
+{
+   struct rq_qos *rqos = data;
+   struct rq_wb *rwb = RQWB(rqos);
+
+   seq_printf(m, "%lu\n", rwb->min_lat_nsec);
+   return 0;
+}
+
+static int wbt_unknown_cnt_show(void *data, struct seq_file *m)
+{
+   struct rq_qos *rqos = data;
+   struct rq_wb *rwb = RQWB(rqos);
+
+   seq_printf(m, "%u\n", rwb->unknown_cnt);
+   return 0;
+}
+
+static int wbt_normal_show(void *data, struct seq_file *m)
+{
+   struct rq_qos *rqos = data;
+   struct rq_wb *rwb = RQWB(rqos);
+
+   seq_printf(m, "%u\n", rwb->wb_normal);
+   return 0;
+}
+
+static int wbt_background_show(void *data, struct seq_file *m)
+{
+   struct rq_qos *rqos = data;
+   struct rq_wb *rwb = RQWB(rqos);
+
+   seq_printf(m, "%u\n", rwb->wb_background);
+   return 0;
+}
+
+static const struct blk_mq_debugfs_attr wbt_debugfs_attrs[] = {
+   {"curr_win_nsec", 0400, wbt_curr_win_nsec_show},
+   {"enabled", 0400, wbt_enabled_show},
+   {"id", 0400, wbt_id_show},
+   {"inflight", 0400, wbt_inflight_show},
+   {"min_lat_nsec", 0400, wbt_min_lat_nsec_show},
+   {"unknown_cnt", 0400, wbt_unknown_cnt_show},
+   {"wb_normal", 0400, wbt_normal_show},
+   {"wb_background", 0400, wbt_background_show},
+   {},
+};
+#endif
+
 static struct rq_qos_ops wbt_rqos_ops = {
    .throttle = wbt_wait,
    .issue = wbt_issue,
@@ -715,6 +804,9 @@
    .cleanup = wbt_cleanup,
    .queue_depth_changed = wbt_queue_depth_changed,
    .exit = wbt_exit,
+#ifdef CONFIG_BLK_DEBUG_FS
+   .debugfs_attrs = wbt_debugfs_attrs,
+#endif
 };

 int wbt_init(struct request_queue *q)
@@ -743,7 +835,6 @@
    rwb->enable_state = WBT_STATE_ON_DEFAULT;
    rwb->wc = 1;
    rwb->rq_depth.default_depth = RWB_DEF_DEPTH;
-   wbt_update_limits(rwb);

    /*
     * Assign rwb and add the stats callback.

The rq_qos_wait function lives in blk-rq-qos.c.

And there are no functional differences between what RH ships and the current 5.13.0 rc:

--- 4.18.0-305.3.1.el8_4/blk-rq-qos.c   2021-06-12 21:04:17.186311731 +0000
+++ 5.13.0-rc5/blk-rq-qos.c 2021-06-12 21:04:33.000000000 +0000
@@ -1,3 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+
 #include "blk-rq-qos.h"

 /*
@@ -230,9 +232,10 @@

 /**
  * rq_qos_wait - throttle on a rqw if we need to
- * @private_data - caller provided specific data
- * @acquire_inflight_cb - inc the rqw->inflight counter if we can
- * @cleanup_cb - the callback to cleanup in case we race with a waker
+ * @rqw: rqw to throttle on
+ * @private_data: caller provided specific data
+ * @acquire_inflight_cb: inc the rqw->inflight counter if we can
+ * @cleanup_cb: the callback to cleanup in case we race with a waker
  *
  * This provides a uniform place for the rq_qos users to do their throttling.
  * Since you can end up with a lot of things sleeping at once, this manages the

omarkilani commented 3 years ago

I'm running one final test with the 4.18.0-305.3.1.el8_4.x86_64 kernel and the original pg_dump workload + fio walwriter simulation with wbt_lat_usec = 0 and then... 🤷‍♂️ .

omarkilani commented 3 years ago

So far the pg_dump has been running with wbt off for around 10 hours, it's close to being finished and no deadlocks have occurred.

I have the latest kernel (5.13.0-rc5) running on a different VM and I've added a bunch of logging output to wbt and rq-qos. I've been able to trigger the deadlock there:

[ 1475.672893] INFO: task txg_quiesce:6681 blocked for more than 122 seconds.
[ 1475.680673]       Tainted: P           O      5.13.0-rc5 #7
[ 1475.682332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.685370] task:txg_quiesce     state:D stack:    0 pid: 6681 ppid:     2 flags:0x00004000
[ 1475.688989] Call Trace:
[ 1475.690466]  __schedule+0x3cf/0x8e0
[ 1475.692841]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1475.696323]  schedule+0x40/0xb0
[ 1475.698835]  cv_wait_common+0x101/0x140 [spl]
[ 1475.702262]  ? finish_wait+0x80/0x80
[ 1475.705119]  __cv_wait+0x15/0x20 [spl]
[ 1475.707783]  txg_quiesce_thread+0x2ba/0x3b0 [zfs]
[ 1475.711653]  ? txg_sync_thread+0x4a0/0x4a0 [zfs]
[ 1475.714658]  ? __thread_exit+0x20/0x20 [spl]
[ 1475.716365]  thread_generic_wrapper+0x74/0x90 [spl]
[ 1475.718236]  kthread+0x126/0x140
[ 1475.719480]  ? kthread_park+0x90/0x90
[ 1475.720788]  ret_from_fork+0x22/0x30
[ 1475.722118] INFO: task fio:13639 blocked for more than 122 seconds.
[ 1475.724534]       Tainted: P           O      5.13.0-rc5 #7
[ 1475.726018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.727833] task:fio             state:D stack:    0 pid:13639 ppid: 13434 flags:0x00004000
[ 1475.729843] Call Trace:
[ 1475.730691]  __schedule+0x3cf/0x8e0
[ 1475.731839]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1475.733200]  schedule+0x40/0xb0
[ 1475.734163]  cv_wait_common+0x101/0x140 [spl]
[ 1475.735486]  ? finish_wait+0x80/0x80
[ 1475.736560]  __cv_wait+0x15/0x20 [spl]
[ 1475.737600]  zil_commit_impl+0x241/0xe40 [zfs]
[ 1475.738974]  zil_commit+0x3d/0x60 [zfs]
[ 1475.740086]  zfs_fsync+0x77/0x100 [zfs]
[ 1475.741240]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1475.742378]  vfs_fsync_range+0x48/0x80
[ 1475.743440]  do_fsync+0x3d/0x70
[ 1475.744425]  __x64_sys_fdatasync+0x17/0x20
[ 1475.745565]  do_syscall_64+0x3c/0x80
[ 1475.746650]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1475.748052] RIP: 0033:0x7fad4cdd356f
[ 1475.749104] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1475.750955] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1475.752763] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1475.754491] RBP: 00007face2e9a000 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1475.756188] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1475.757927] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1475.759700] INFO: task fio:13640 blocked for more than 122 seconds.
[ 1475.761301]       Tainted: P           O      5.13.0-rc5 #7
[ 1475.763130] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.764990] task:fio             state:D stack:    0 pid:13640 ppid: 13434 flags:0x00004000
[ 1475.766963] Call Trace:
[ 1475.767713]  __schedule+0x3cf/0x8e0
[ 1475.768690]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1475.769920]  schedule+0x40/0xb0
[ 1475.770850]  cv_wait_common+0x101/0x140 [spl]
[ 1475.772018]  ? finish_wait+0x80/0x80
[ 1475.773084]  __cv_wait+0x15/0x20 [spl]
[ 1475.774106]  zil_commit_impl+0x241/0xe40 [zfs]
[ 1475.775296]  zil_commit+0x3d/0x60 [zfs]
[ 1475.776406]  zfs_fsync+0x77/0x100 [zfs]
[ 1475.777528]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1475.778630]  vfs_fsync_range+0x48/0x80
[ 1475.779607]  do_fsync+0x3d/0x70
[ 1475.780550]  __x64_sys_fdatasync+0x17/0x20
[ 1475.781573]  do_syscall_64+0x3c/0x80
[ 1475.782490]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1475.783782] RIP: 0033:0x7fad4cdd356f
[ 1475.784834] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1475.786703] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1475.788368] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1475.790020] RBP: 00007face2edea18 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1475.791750] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1475.793481] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1475.812159] INFO: task fio:13645 blocked for more than 123 seconds.
[ 1475.818468]       Tainted: P           O      5.13.0-rc5 #7
[ 1475.823467] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.832167] task:fio             state:D stack:    0 pid:13645 ppid: 13434 flags:0x00004000
[ 1475.841001] Call Trace:
[ 1475.842841]  __schedule+0x3cf/0x8e0
[ 1475.845170]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1475.850962]  schedule+0x40/0xb0
[ 1475.854814]  cv_wait_common+0x101/0x140 [spl]
[ 1475.859826]  ? finish_wait+0x80/0x80
[ 1475.863914]  __cv_wait+0x15/0x20 [spl]
[ 1475.868357]  zil_commit_impl+0x241/0xe40 [zfs]
[ 1475.873168]  ? tsd_hash_search.isra.2+0x72/0xa0 [spl]
[ 1475.878508]  ? __cond_resched+0x19/0x30
[ 1475.882175]  zil_commit+0x3d/0x60 [zfs]
[ 1475.885621]  zfs_fsync+0x77/0x100 [zfs]
[ 1475.889968]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1475.893861]  vfs_fsync_range+0x48/0x80
[ 1475.898164]  do_fsync+0x3d/0x70
[ 1475.902002]  __x64_sys_fdatasync+0x17/0x20
[ 1475.906572]  do_syscall_64+0x3c/0x80
[ 1475.909937]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1475.915348] RIP: 0033:0x7fad4cdd356f
[ 1475.918692] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1475.925309] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1475.931657] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1475.938185] RBP: 00007face2f23430 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1475.940354] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1475.945659] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1475.953881] INFO: task fio:13650 blocked for more than 123 seconds.
[ 1475.959671]       Tainted: P           O      5.13.0-rc5 #7
[ 1475.961244] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1475.966900] task:fio             state:D stack:    0 pid:13650 ppid: 13434 flags:0x00004000
[ 1475.972529] Call Trace:
[ 1475.974065]  __schedule+0x3cf/0x8e0
[ 1475.976477]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1475.980662]  schedule+0x40/0xb0
[ 1475.982984]  cv_wait_common+0x101/0x140 [spl]
[ 1475.987037]  ? finish_wait+0x80/0x80
[ 1475.990871]  __cv_wait+0x15/0x20 [spl]
[ 1475.993905]  zil_commit_impl+0x241/0xe40 [zfs]
[ 1475.996840]  zil_commit+0x3d/0x60 [zfs]
[ 1475.999259]  zfs_fsync+0x77/0x100 [zfs]
[ 1476.001682]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.004030]  vfs_fsync_range+0x48/0x80
[ 1476.007726]  do_fsync+0x3d/0x70
[ 1476.011257]  __x64_sys_fdatasync+0x17/0x20
[ 1476.015551]  do_syscall_64+0x3c/0x80
[ 1476.018730]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.022309] RIP: 0033:0x7fad4cdd356f
[ 1476.026080] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.030410] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.035633] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.041674] RBP: 00007face2f67e48 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.045831] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.050323] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1476.056931] INFO: task fio:13651 blocked for more than 123 seconds.
[ 1476.063274]       Tainted: P           O      5.13.0-rc5 #7
[ 1476.070005] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1476.076462] task:fio             state:D stack:    0 pid:13651 ppid: 13434 flags:0x00004000
[ 1476.078773] Call Trace:
[ 1476.079938]  __schedule+0x3cf/0x8e0
[ 1476.081461]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1476.082981]  schedule+0x40/0xb0
[ 1476.084131]  cv_wait_common+0x101/0x140 [spl]
[ 1476.085515]  ? finish_wait+0x80/0x80
[ 1476.086753]  __cv_wait+0x15/0x20 [spl]
[ 1476.088071]  zil_commit_impl+0x241/0xe40 [zfs]
[ 1476.089910]  zil_commit+0x3d/0x60 [zfs]
[ 1476.091653]  zfs_fsync+0x77/0x100 [zfs]
[ 1476.092929]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.094341]  vfs_fsync_range+0x48/0x80
[ 1476.096293]  do_fsync+0x3d/0x70
[ 1476.097406]  __x64_sys_fdatasync+0x17/0x20
[ 1476.098626]  do_syscall_64+0x3c/0x80
[ 1476.100928]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.103333] RIP: 0033:0x7fad4cdd356f
[ 1476.105919] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.110575] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.115372] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.120525] RBP: 00007face2fac860 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.125132] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.130004] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1476.135627] INFO: task fio:13653 blocked for more than 123 seconds.
[ 1476.140045]       Tainted: P           O      5.13.0-rc5 #7
[ 1476.143808] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1476.148478] task:fio             state:D stack:    0 pid:13653 ppid: 13434 flags:0x00004000
[ 1476.154143] Call Trace:
[ 1476.156033]  __schedule+0x3cf/0x8e0
[ 1476.158125]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1476.160982]  schedule+0x40/0xb0
[ 1476.163165]  cv_wait_common+0x101/0x140 [spl]
[ 1476.165595]  ? finish_wait+0x80/0x80
[ 1476.167565]  __cv_wait+0x15/0x20 [spl]
[ 1476.169359]  zil_commit_impl+0x241/0xe40 [zfs]
[ 1476.170756]  zil_commit+0x3d/0x60 [zfs]
[ 1476.171867]  zfs_fsync+0x77/0x100 [zfs]
[ 1476.173308]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.174497]  vfs_fsync_range+0x48/0x80
[ 1476.175470]  do_fsync+0x3d/0x70
[ 1476.176484]  __x64_sys_fdatasync+0x17/0x20
[ 1476.177618]  do_syscall_64+0x3c/0x80
[ 1476.178764]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.180025] RIP: 0033:0x7fad4cdd356f
[ 1476.181390] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.183251] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.184992] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.187140] RBP: 00007face2ff1278 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.188884] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.190732] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1476.192817] INFO: task fio:13654 blocked for more than 123 seconds.
[ 1476.194614]       Tainted: P           O      5.13.0-rc5 #7
[ 1476.196167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1476.198192] task:fio             state:D stack:    0 pid:13654 ppid: 13434 flags:0x00004000
[ 1476.200497] Call Trace:
[ 1476.201321]  __schedule+0x3cf/0x8e0
[ 1476.202299]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1476.203785]  schedule+0x40/0xb0
[ 1476.204847]  cv_wait_common+0x101/0x140 [spl]
[ 1476.206131]  ? finish_wait+0x80/0x80
[ 1476.207393]  __cv_wait+0x15/0x20 [spl]
[ 1476.208566]  zil_commit_impl+0x241/0xe40 [zfs]
[ 1476.209991]  zil_commit+0x3d/0x60 [zfs]
[ 1476.211291]  zfs_fsync+0x77/0x100 [zfs]
[ 1476.212645]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.213857]  vfs_fsync_range+0x48/0x80
[ 1476.214939]  do_fsync+0x3d/0x70
[ 1476.215984]  __x64_sys_fdatasync+0x17/0x20
[ 1476.217298]  do_syscall_64+0x3c/0x80
[ 1476.218293]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.219782] RIP: 0033:0x7fad4cdd356f
[ 1476.220794] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.222676] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.224484] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.226342] RBP: 00007face3035c90 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.228632] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.230491] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1476.232488] INFO: task fio:13656 blocked for more than 123 seconds.
[ 1476.234220]       Tainted: P           O      5.13.0-rc5 #7
[ 1476.235769] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1476.238163] task:fio             state:D stack:    0 pid:13656 ppid: 13434 flags:0x00004000
[ 1476.240524] Call Trace:
[ 1476.241474]  __schedule+0x3cf/0x8e0
[ 1476.242773]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1476.244450]  schedule+0x40/0xb0
[ 1476.245587]  cv_wait_common+0x101/0x140 [spl]
[ 1476.246806]  ? finish_wait+0x80/0x80
[ 1476.248046]  __cv_wait+0x15/0x20 [spl]
[ 1476.249366]  zil_commit_impl+0x241/0xe40 [zfs]
[ 1476.251020]  zil_commit+0x3d/0x60 [zfs]
[ 1476.252354]  zfs_fsync+0x77/0x100 [zfs]
[ 1476.253753]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.255040]  vfs_fsync_range+0x48/0x80
[ 1476.256043]  do_fsync+0x3d/0x70
[ 1476.257211]  __x64_sys_fdatasync+0x17/0x20
[ 1476.258575]  do_syscall_64+0x3c/0x80
[ 1476.259608]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.260953] RIP: 0033:0x7fad4cdd356f
[ 1476.261969] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.264344] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.266401] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.268398] RBP: 00007face307a6a8 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.270290] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.272072] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8
[ 1476.273833] INFO: task fio:13662 blocked for more than 123 seconds.
[ 1476.275345]       Tainted: P           O      5.13.0-rc5 #7
[ 1476.276861] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1476.278787] task:fio             state:D stack:    0 pid:13662 ppid: 13434 flags:0x00004000
[ 1476.280830] Call Trace:
[ 1476.281685]  __schedule+0x3cf/0x8e0
[ 1476.282697]  ? _raw_spin_unlock_irqrestore+0xe/0x30
[ 1476.284090]  schedule+0x40/0xb0
[ 1476.285184]  cv_wait_common+0x101/0x140 [spl]
[ 1476.286497]  ? finish_wait+0x80/0x80
[ 1476.287544]  __cv_wait+0x15/0x20 [spl]
[ 1476.288690]  zil_commit_impl+0x241/0xe40 [zfs]
[ 1476.289994]  zil_commit+0x3d/0x60 [zfs]
[ 1476.291166]  zfs_fsync+0x77/0x100 [zfs]
[ 1476.292621]  zpl_fsync+0x6c/0xa0 [zfs]
[ 1476.294031]  vfs_fsync_range+0x48/0x80
[ 1476.295083]  do_fsync+0x3d/0x70
[ 1476.296058]  __x64_sys_fdatasync+0x17/0x20
[ 1476.297157]  do_syscall_64+0x3c/0x80
[ 1476.298260]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 1476.299595] RIP: 0033:0x7fad4cdd356f
[ 1476.300724] RSP: 002b:00007fff2b050f60 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
[ 1476.302624] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fad4cdd356f
[ 1476.304230] RDX: 0000000000000000 RSI: 0000555842f724c0 RDI: 0000000000000005
[ 1476.306194] RBP: 00007face30bf0c0 R08: 00007fff2b0c4170 R09: 0000000000000000
[ 1476.308274] R10: 00007fff2b0c41b0 R11: 0000000000000293 R12: 0000000000000004
[ 1476.309945] R13: 0000555842f724c0 R14: 0000000000000000 R15: 0000555842f724e8

It's a little different because there's no z_wr_iss_h backtrace but I can see that it's hung in ps.

I'm going to study my wbt/rq-qos logging output and see if there's anything noteworthy in there. Not really sure what I'm looking for yet, and I might try to repro on xfs.

omarkilani commented 3 years ago

The pg_dump completed successfully on 4.18.0-305.3.1.el8_4.x86_64 with wbt turned off. dmesg was empty for the ~18 hours it took to complete:

[  996.501515] hrtimer: interrupt took 9638 ns
[root@instance-20210606-1831 ~]# du -sh /dump/pg3/
577G    /dump/pg3/
[root@instance-20210606-1831 ~]# uptime
 15:54:11 up 21:52,  1 user,  load average: 0.02, 0.03, 0.01

I was able to reproduce the deadlock with wbt on 5.13.0-rc5:

https://gist.githubusercontent.com/omarkilani/ff0a96d872e09b4fb648272d104e0053/raw/d3da3974162f8aa87b7309317af80929fadf250f/dmesg.wbt.deadlock.log

So I've posted about this issue over on linux-block:

https://lore.kernel.org/linux-block/CA+8F9hggf7jOcGRxvBoa8FYxQs8ZV+XueVAd9BodpQQP_+8Pdw@mail.gmail.com/T/

I guess we'll see. :)

omarkilani commented 3 years ago

Just in case anyone stumbles across this issue, it has been fixed and queued for 5.14:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.14/block&id=11c7aa0ddea8611007768d3e6b58d45dc60a19e1

Unrelated to ZFS in the end, though ZFS should consider implementing its own writer throttling system as it has far more information about its activity than it exposes to the Linux block layer.

When I ran my pg_dump while running fio for writes, it took more than 3x longer with wbt off than on (18 hours vs 5 hours).

rincebrain commented 3 years ago

IMO you should leave this open, as it's not necessarily the case that the commit in question is guaranteed to get backported everywhere, unless someone in charge says "nah we can just tell them to run a not-broken kernel even if their distro doesn't package it".

I believe ZFS does have some degree of scheduling and backoff. I'm sure it could be refined, given the data you're sharing.

omarkilani commented 3 years ago

Alright.

I’m not clear on how patches get back-propagated to the other Linux kernel versions in the main line kernels but this one breaks the module ABI, so it’ll be interesting to see what RedHat does.

Without the patch any Linux system in existence could deadlock on writes essentially randomly, albeit rarely. So that’s fun.

rincebrain commented 3 years ago

(I believe) RH can and will backport absolutely anything if they feel strongly enough about it (and if their customer systems start hanging, I imagine they will). I am no longer clear on the timetables between RHEL 8, CentOS 8, and CentOS Stream 8 kernels, but since RHEL only releases their deltas against mainline as massive blobs now, I can't see them not including it if RHEL does. God only knows what OEL will do.

I believe other distros tend to be more limited in what they'll grab - but that patch doesn't look really invasive, to me, so I imagine if someone reports the issue happening on a distro's kernel, they might well pull it.

kernelOfTruth commented 3 years ago

Alright.

I’m not clear on how patches get back-propagated to the other Linux kernel versions in the main line kernels but this one breaks the module ABI, so it’ll be interesting to see what RedHat does.

Without the patch any Linux system in existence could deadlock on writes essentially randomly, albeit rarely. So that’s fun.

youzhongyang at https://github.com/openzfs/zfs/issues/12085#issuecomment-844474354 suggested that setting wbt_lat_usec to 0 might potentially mitigate the issue for the time being

We hope by turning wbt off (/sys/block/*/queue/wbt_lat_usec set to 0), it will never happen again. Time will tell if it helps.

I'll give it a try as well

for file in /sys/block/*/queue/wbt_lat_usec; do echo "0" > $file; done

should do it

rincebrain commented 3 years ago

If you look 20+ comments up, you'll see that the entire recent discussion, in which they disclose the impact of changing that setting, and point ultimately to the bug involving the WBT, was started with a link to that comment.

So yes, that's been covered.

omarkilani commented 3 years ago

FWIW, the wbt fix has been merged into linus' tree:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/diff/block/blk-rq-qos.c?id=df668a5fe461bb9d7e899c538acc7197746038f4

Hopefully this is something people can advocate to have included in distro kernels.

stale[bot] commented 2 years ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

openzfs / zfs