openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.68k stars 1.76k forks source link

l2arc_mfuonly_pos: VERIFY3(tx->tx_txg <= spa_final_dirty_txg(os->os_spa)) failed (64 <= 63) #16714

Closed tonyhutter closed 5 days ago

tonyhutter commented 3 weeks ago

System information

Type Version/Name
Distribution Name Almalinux
Distribution Version 9
Kernel Version 5.14
Architecture x86-64
OpenZFS Version master

Describe the problem you're observing

QEMU Almalinux 9 runner reported l2arc_mfuonly_pos as KILLED with this panic:

  [ 3323.831973] ZTS run /usr/share/zfs/zfs-tests/tests/functional/l2arc/l2arc_arcstats_pos
  [ 3353.615697] ZTS run /usr/share/zfs/zfs-tests/tests/functional/l2arc/l2arc_mfuonly_pos
  [ 3367.508166] VERIFY3(tx->tx_txg <= spa_final_dirty_txg(os->os_spa)) failed (64 <= 63)
  [ 3367.510359] PANIC at dbuf.c:2324:dbuf_dirty()
  [ 3367.511813] Showing stack for process 195727
  [ 3367.513288] CPU: 1 PID: 195727 Comm: txg_sync Tainted: P           OE     -------  ---  5.14.0-427.40.1.el9_4.x86_64 #1
  [ 3367.516585] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
  [ 3367.519458] Call Trace:
  [ 3367.520379]  <TASK>
  [ 3367.521181]  dump_stack_lvl+0x34/0x48
  [ 3367.522539]  spl_panic+0xd1/0xe9 [spl]
  [ 3367.524011]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.525672]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.527227]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.528820]  ? dbuf_rele_and_unlock+0x267/0x500 [zfs]
  [ 3367.533862]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.535528]  dbuf_dirty+0xd68/0x12f0 [zfs]
  [ 3367.537248]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.538917]  spa_history_log_sync+0x100/0x5d0 [zfs]
  [ 3367.540816]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.542488]  ? list_head+0x9/0x30 [zfs]
  [ 3367.544072]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.545802]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.547495]  ? cv_destroy_wakeup+0x4a/0x90 [spl]
  [ 3367.549141]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.550850]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.552511]  ? kmem_cache_free+0x15/0x360
  [ 3367.554043]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.555756]  ? srso_alias_return_thunk+0x5/0xfbef5
  [ 3367.557446]  dsl_sync_task_sync+0xb5/0x120 [zfs]
  [ 3367.559274]  dsl_pool_sync+0x4ba/0x690 [zfs]
  [ 3367.560973]  spa_sync_iterate_to_convergence+0xd7/0x320 [zfs]
  [ 3367.563611]  spa_sync+0x319/0x920 [zfs]
  [ 3367.565699]  txg_sync_thread+0x27b/0x3d0 [zfs]
  [ 3367.567201]  ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
  [ 3367.568956]  ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
  [ 3367.570910]  thread_generic_wrapper+0x62/0x90 [spl]
  [ 3367.572615]  kthread+0xe0/0x100
  [ 3367.573889]  ? __pfx_kthread+0x10/0x10
  [ 3367.575225]  ret_from_fork+0x2c/0x50
Test: /usr/share/zfs/zfs-tests/tests/functional/l2arc/l2arc_mfuonly_pos (run as root) [10:00] [KILLED]
18:31:46.29 ASSERTION: l2arc_mfuonly does not cache MRU buffers.
18:31:46.29 SUCCESS: set_tunable32 L2ARC_NOPREFETCH 1
18:31:46.30 SUCCESS: set_tunable32 L2ARC_MFUONLY 1
18:31:46.31 SUCCESS: set_tunable32 PREFETCH_DISABLE 1
18:31:46.31 SUCCESS: truncate -s 1120M /var/tmp/testdir/disk.l2arc/b
18:31:46.44 SUCCESS: zpool create -f testpool /var/tmp/testdir/disk.l2arc/a cache /var/tmp/testdir/disk.l2arc/b
18:31:46.53 job: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=psync, iodepth=1
18:31:46.53 ...
18:31:46.53 fio-3.35
18:31:46.53 Starting 4 threads
18:31:46.53 job: Laying out IO file (1 file / 200MiB)
18:31:46.53 job: Laying out IO file (1 file / 200MiB)
18:31:46.53 job: Laying out IO file (1 file / 200MiB)
18:31:46.53 job: Laying out IO file (1 file / 200MiB)
18:31:49.26 
18:31:49.26 job: (groupid=0, jobs=4): err= 0: pid=195631: Wed Oct 30 18:31:49 2024
18:31:49.26   write: IOPS=307, BW=308MiB/s (323MB/s)(800MiB/2600msec); 0 zone resets
18:31:49.26     clat (usec): min=337, max=435441, avg=11256.04, stdev=33683.62
18:31:49.26      lat (usec): min=338, max=435441, avg=11258.68, stdev=33683.34
18:31:49.26     clat percentiles (usec):
18:31:49.26      |  1.00th=[   343],  5.00th=[   379], 10.00th=[   396], 20.00th=[   424],
18:31:49.26      | 30.00th=[   457], 40.00th=[   482], 50.00th=[   529], 60.00th=[   627],
18:31:49.26      | 70.00th=[   873], 80.00th=[  2212], 90.00th=[ 47449], 95.00th=[ 77071],
18:31:49.26      | 99.00th=[127402], 99.50th=[170918], 99.90th=[434111], 99.95th=[434111],
18:31:49.26      | 99.99th=[434111]
18:31:49.26    bw (  KiB/s): min=167598, max=547588, per=100.00%, avg=355918.33, stdev=37485.32, samples=17
18:31:49.26    iops        : min=  160, max=  532, avg=345.07, stdev=36.68, samples=17
18:31:49.26   lat (usec)   : 500=44.62%, 750=22.00%, 1000=4.88%
18:31:49.26   lat (msec)   : 2=8.12%, 4=2.88%, 10=3.12%, 20=1.62%, 50=3.12%
18:31:49.26   lat (msec)   : 100=7.00%, 250=2.25%, 500=0.38%
18:31:49.26   cpu          : usr=1.33%, sys=7.42%, ctx=608, majf=0, minf=0
18:31:49.26   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
18:31:49.26      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
18:31:49.26      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
18:31:49.26      issued rwts: total=0,800,0,0 short=0,0,0,0 dropped=0,0,0,0
18:31:49.26      latency   : target=0, window=0, percentile=100.00%, depth=1
18:31:49.26 
18:31:49.26 Run status group 0 (all jobs):
18:31:49.26   WRITE: bw=308MiB/s (323MB/s), 308MiB/s-308MiB/s (323MB/s-323MB/s), io=800MiB (839MB), run=2600-2600msec
18:31:49.27 SUCCESS: fio /usr/share/zfs/zfs-tests/tests/perf/fio/mkfiles.fio
18:31:49.35 job: (g=0): rw=randread, bs=(R) 128KiB-128KiB, (W) 128KiB-128KiB, (T) 128KiB-128KiB, ioengine=psync, iodepth=1
18:31:49.35 ...
18:31:49.35 fio-3.35
18:31:49.35 Starting 4 threads
18:31:59.51 
18:31:59.51 job: (groupid=0, jobs=4): err= 0: pid=195640: Wed Oct 30 18:31:59 2024
18:31:59.51   read: IOPS=20.0k, BW=2500MiB/s (2621MB/s)(24.4GiB/10001msec)
18:31:59.51     clat (usec): min=13, max=597270, avg=195.23, stdev=2346.76
18:31:59.51      lat (usec): min=13, max=597270, avg=195.57, stdev=2347.39
18:31:59.51     clat percentiles (usec):
18:31:59.51      |  1.00th=[   40],  5.00th=[   48], 10.00th=[   56], 20.00th=[   59],
18:31:59.51      | 30.00th=[   61], 40.00th=[   63], 50.00th=[   64], 60.00th=[   65],
18:31:59.51      | 70.00th=[   67], 80.00th=[  105], 90.00th=[  133], 95.00th=[  161],
18:31:59.51      | 99.00th=[ 2409], 99.50th=[ 7111], 99.90th=[15270], 99.95th=[27395],
18:31:59.51      | 99.99th=[60556]
18:31:59.51    bw (  MiB/s): min=  195, max= 4063, per=100.00%, avg=2502.80, stdev=273.59, samples=76
18:31:59.51    iops        : min= 1565, max=32504, avg=20021.89, stdev=2188.90, samples=76
18:31:59.51   lat (usec)   : 20=0.51%, 50=5.79%, 100=73.51%, 250=17.34%, 500=0.45%
18:31:59.51   lat (usec)   : 750=0.13%, 1000=0.06%
18:31:59.51   lat (msec)   : 2=1.04%, 4=0.31%, 10=0.59%, 20=0.19%, 50=0.05%
18:31:59.51   lat (msec)   : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
18:31:59.51   cpu          : usr=1.33%, sys=42.31%, ctx=5993, majf=0, minf=124
18:31:59.51   IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
18:31:59.51      submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
18:31:59.51      complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
18:31:59.51      issued rwts: total=199981,0,0,0 short=0,0,0,0 dropped=0,0,0,0
18:31:59.51      latency   : target=0, window=0, percentile=100.00%, depth=1
18:31:59.51 
18:31:59.51 Run status group 0 (all jobs):
18:31:59.51    READ: bw=2500MiB/s (2621MB/s), 2500MiB/s-2500MiB/s (2621MB/s-2621MB/s), io=24.4GiB (26.2GB), run=10001-10001msec
18:31:59.51 SUCCESS: fio /usr/share/zfs/zfs-tests/tests/perf/fio/random_reads.fio
18:31:59.64 SUCCESS: zpool export testpool
18:32:00.08 SUCCESS: zpool import -N -d /var/tmp/testdir/disk.l2arc testpool
18:32:00.09 SUCCESS: test 0 -eq 0
Test: /usr/share/zfs/zfs-tests/tests/functional/l2arc/l2arc_l2miss_pos (run as root) [10:00] [KILLED]

Describe how to reproduce the problem

Saw this as a test result while testing an unrelated PR https://github.com/tonyhutter/zfs/actions/runs/11598165017/job/32293419930?pr=3

Include any warning/errors/backtraces from the system logs

amotin commented 1 week ago

I think this problem goes from L2ARC rebuild not being stopped by spa_export_common() before setting spa_final_txg, which with some luck may log its completion between that time and spa_l2cache_drop() call by spa_unload().

amotin commented 1 week ago

I've intercepted the log message that causes it, confirming my guess:

VERIFY3(tx->tx_txg <= spa_final_dirty_txg(spa)) failed (64 <= 63) L2ARC rebuild successful, restored %llu blocks