Open ghoshdipak opened 3 years ago
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
We are still experiencing the same issue with OpenZFS 2.1.4
Thanks for retesting this with OpenZFS 2.1.4, I'll look in to what's exactly going on here. Somewhat related to this you might want to look at PR #13499 which slightly changes the ZED behavior. This fix was queued up for 2.1.5.
The same issue is seen when tried with the PR:13499 zpool_clear.txt
@akashb-22 would you mind verifying the fix in #13555 resolves this issue.
zpool status -v
, but the actual data is correct(verified using md5sum).Thanks for the quick turn around. I see what's going on with case 2 and 3 and will work on updating the PR to take a slightly different approach.
I've updated the PR. Would you mind testing the updated patch.
Case 1: Damage to a single vdev is working fine with the fix.
Case 2: Damage to 2 child vdevs on draid2 multiple times along with zpool clear
is working fine with the fix.
Case3: Damage to 2 child vdevs simultaneously one after other on draid2 + 2 dspare results in all the child vdevs going to DEGRADED
state and eventually leads to data loss, but after zpool-clear and a scrub brings the zpool to ONLINE
state and all the data files are present in the filesystem with no loss. This case works fine with raidz2.
Also, this error is seen on zpool status momentarily during the tests(case3):
munmap_chunk(): invalid pointer
Aborted (core dumped)
That third test case really is a pretty nasty absolute worst case scenario. I was able to reproduce it locally but it'll take me a while to get to a root cause. In the meanwhile, let's move forward with the current PR which at least sorts out those first two test cases.
@behlendorf Another case is observed where after the rebuild once the scrub is in-progress on issuing a zpool_clear on DEBUG builds panic at this point. however, I do not see any issue in the normal builds. I have a testcase where it is 100% reproducible. This is seen in zfs-2.1-release / zfs-2.2-release and master. stacktrace:
[Tue Feb 13 00:55:35 2024] VERIFY(spa_config_held(os->os_spa, SCL_ALL, RW_WRITER) == 0 || (spa_is_root(os->os_spa) && spa_config_held(os->os_spa, SCL_STATE, RW_WRITER))) failed
[Tue Feb 13 00:55:35 2024] PANIC at dnode.c:1422:dnode_hold_impl()
[Tue Feb 13 00:55:35 2024] Showing stack for process 186367
[Tue Feb 13 00:55:35 2024] CPU: 4 PID: 186367 Comm: zpool Kdump: loaded Tainted: P OE --------- - - 4.18.0-372.9.1.el8.x86_64 #1
[Tue Feb 13 00:55:35 2024] Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.10.2-1ubuntu1 04/01/2014
[Tue Feb 13 00:55:35 2024] Call Trace:
[Tue Feb 13 00:55:35 2024] dump_stack+0x41/0x60
[Tue Feb 13 00:55:35 2024] spl_panic+0xd0/0xe8 [spl]
[Tue Feb 13 00:55:35 2024] ? virtqueue_add+0xace/0xb40
[Tue Feb 13 00:55:35 2024] dnode_hold_impl+0x755/0x11c0 [zfs]
[Tue Feb 13 00:55:35 2024] ? cpumask_next_and+0x1a/0x20
[Tue Feb 13 00:55:35 2024] ? update_sd_lb_stats.constprop.121+0xca/0x830
[Tue Feb 13 00:55:35 2024] dmu_buf_hold_noread+0x37/0x140 [zfs]
[Tue Feb 13 00:55:35 2024] dmu_buf_hold+0x37/0x80 [zfs]
[Tue Feb 13 00:55:35 2024] zap_lockdir+0x54/0x110 [zfs]
[Tue Feb 13 00:55:35 2024] ? find_busiest_group+0x41/0x360
[Tue Feb 13 00:55:35 2024] zap_lookup_norm+0x5d/0xd0 [zfs]
[Tue Feb 13 00:55:35 2024] zap_lookup+0x12/0x20 [zfs]
[Tue Feb 13 00:55:35 2024] zfs_dirent_lock+0x5b5/0x7c0 [zfs]
[Tue Feb 13 00:55:35 2024] zfs_dirlook+0x8a/0x300 [zfs]
[Tue Feb 13 00:55:35 2024] ? zfs_zaccess+0x21e/0x440 [zfs]
[Tue Feb 13 00:55:35 2024] zfs_lookup+0x24b/0x400 [zfs]
[Tue Feb 13 00:55:35 2024] zpl_lookup+0xc2/0x290 [zfs]
[Tue Feb 13 00:55:35 2024] __lookup_slow+0x97/0x150
[Tue Feb 13 00:55:35 2024] lookup_slow+0x35/0x50
[Tue Feb 13 00:55:35 2024] walk_component+0x1bf/0x2f0
[Tue Feb 13 00:55:35 2024] ? legitimize_path.isra.42+0x2d/0x60
[Tue Feb 13 00:55:35 2024] path_lookupat.isra.47+0x75/0x200
[Tue Feb 13 00:55:35 2024] ? filename_lookup.part.61+0xe0/0x170
[Tue Feb 13 00:55:35 2024] ? kmem_cache_free+0x116/0x300
[Tue Feb 13 00:55:35 2024] filename_lookup.part.61+0xa0/0x170
[Tue Feb 13 00:55:35 2024] ? __switch_to_asm+0x35/0x70
[Tue Feb 13 00:55:35 2024] ? getname_kernel+0x2c/0x100
[Tue Feb 13 00:55:35 2024] ? kmem_cache_alloc+0x13f/0x280
[Tue Feb 13 00:55:35 2024] ? vdev_draid_open_spares+0x30/0x30 [zfs]
[Tue Feb 13 00:55:35 2024] ? vdev_draid_open_spares+0x30/0x30 [zfs]
[Tue Feb 13 00:55:35 2024] lookup_bdev.part.46+0x28/0xa0
[Tue Feb 13 00:55:35 2024] zvol_is_zvol_impl+0xb/0x40 [zfs]
[Tue Feb 13 00:55:35 2024] vdev_uses_zvols+0x18/0x70 [zfs]
[Tue Feb 13 00:55:35 2024] vdev_uses_zvols+0x3d/0x70 [zfs]
[Tue Feb 13 00:55:35 2024] vdev_uses_zvols+0x3d/0x70 [zfs]
[Tue Feb 13 00:55:35 2024] vdev_open_children_impl+0xaa/0x120 [zfs]
[Tue Feb 13 00:55:35 2024] vdev_draid_open+0x63/0x1f0 [zfs]
[Tue Feb 13 00:55:35 2024] vdev_open+0xe4/0xa10 [zfs]
[Tue Feb 13 00:55:35 2024] ? mutex_lock+0xe/0x30
[Tue Feb 13 00:55:35 2024] vdev_reopen+0x4e/0x1f0 [zfs]
[Tue Feb 13 00:55:35 2024] vdev_clear+0x188/0x2b0 [zfs]
[Tue Feb 13 00:55:35 2024] vdev_clear+0x87/0x2b0 [zfs]
[Tue Feb 13 00:55:35 2024] vdev_clear+0x87/0x2b0 [zfs]
[Tue Feb 13 00:55:35 2024] vdev_clear+0x87/0x2b0 [zfs]
[Tue Feb 13 00:55:35 2024] zfs_ioc_clear+0x18b/0x380 [zfs]
[Tue Feb 13 00:55:35 2024] ? strlcpy+0x2d/0x40
[Tue Feb 13 00:55:35 2024] zfsdev_ioctl_common+0x656/0x980 [zfs]
[Tue Feb 13 00:55:35 2024] ? __kmalloc_node+0x26e/0x2a0
[Tue Feb 13 00:55:35 2024] ? __handle_mm_fault+0x7a6/0x7e0
[Tue Feb 13 00:55:35 2024] zfsdev_ioctl+0x4f/0xe0 [zfs]
[Tue Feb 13 00:55:35 2024] do_vfs_ioctl+0xa4/0x680
[Tue Feb 13 00:55:35 2024] ? handle_mm_fault+0xc1/0x1e0
[Tue Feb 13 00:55:35 2024] ? syscall_trace_enter+0x1fb/0x2c0
[Tue Feb 13 00:55:35 2024] ksys_ioctl+0x60/0x90
[Tue Feb 13 00:55:35 2024] __x64_sys_ioctl+0x16/0x20
[Tue Feb 13 00:55:35 2024] do_syscall_64+0x5b/0x1a0
[Tue Feb 13 00:55:35 2024] entry_SYSCALL_64_after_hwframe+0x65/0xca
[Tue Feb 13 00:55:35 2024] RIP: 0033:0x7fea9f6836db
[Tue Feb 13 00:55:35 2024] Code: 73 01 c3 48 8b 0d ad 57 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 7d 57 38 00 f7 d8 64 89 01 48
[Tue Feb 13 00:55:35 2024] RSP: 002b:00007ffd5047ce08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Tue Feb 13 00:55:35 2024] RAX: ffffffffffffffda RBX: 00007ffd5047d250 RCX: 00007fea9f6836db
[Tue Feb 13 00:55:35 2024] RDX: 00007ffd5047d250 RSI: 0000000000005a21 RDI: 0000000000000003
[Tue Feb 13 00:55:35 2024] RBP: 00007ffd50480830 R08: 0000000000000040 R09: 0000000001b9721c
[Tue Feb 13 00:55:35 2024] R10: 000000000000001c R11: 0000000000000246 R12: 0000000001b5db70
[Tue Feb 13 00:55:35 2024] R13: 0000000000000000 R14: 0000000001b61bc0 R15: 00007ffd5047ce50
Further testing shows this issue is
Issue is always 100% hit w/ above testcase in draid2 w/ 2 dspares.
CentOS8.2 4.18.0-193.28.1.x5.0.22.x86_64 modinfo zfs | grep -iw version version: 2.1.0-x.5.0_520_ge8c59ac5 modinfo spl | grep -iw version version: 2.1.0-x.5.0_520_ge8c59ac5
Describe the problem you're observing
Intentional damage (zero'd out the entire device) was done on a single vdev of a draid2 pool with 2 spares. After running IO to the mounted file system on this pool, the pool became degraded, as expected. Distributed spare got engaged and resilver started. Scrub finished with zero bytes repaired. Issuing 'zpool clear' brings the DEGRADED pool and vdev to ONLINE state. If we run IO or scrub the pool becomes degraded again.
Describe how to reproduce the problem
echo 1 > /sys/module/zfs/parameters/zfs_prefetch_disable cat /sys/module/zfs/parameters/zfs_prefetch_disable 1 truncate -s 1G file1 file2 file3 file4 file5 file6 file7 file8 file9 file10 file11 file12 file13 file14 file15 pwd /root/test/files zpool create -f -o cachefile=none -o failmode=panic -O canmount=off pool-oss0 draid2:11d:15c:2s /root/test/files/file1 /root/test/files/file2 /root/test/files/file3 /root/test/files/file4 /root/test/files/file5 /root/test/files/file6 /root/test/files/file7 /root/test/files/file8 /root/test/files/file9 /root/test/files/file10 /root/test/files/file11 /root/test/files/file12 /root/test/files/file13 /root/test/files/file14 /root/test/files/file15 zfs create -o mountpoint=/mnt/ost0 pool-oss0/ost0 [root@localhost files]# [root@localhost files]# [root@localhost files]# [root@localhost files]# fio --name=test --ioengine=libaio --fallocate=none --iodepth=4 --rw=write --bs=1M --size=8G --numjobs=1 --allow_mounted_write=1 --do_verify=0 --verify=pattern --verify_pattern=0x0A0B0C0D --filename=/mnt/ost0/testfile --group_reporting test: (g=0): rw=write, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=4 fio-3.7 Starting 1 process test: Laying out IO file (1 file / 8192MiB) Jobs: 1 (f=1): [W(1)][100.0%][r=0KiB/s,w=66.6MiB/s][r=0,w=66 IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=10718: Wed May 19 22:26:15 2021 write: IOPS=89, BW=89.0MiB/s (93.3MB/s)(8192MiB/92043msec) slat (usec): min=206, max=2843.9k, avg=11218.74, stdev=128346.34 clat (usec): min=3, max=2848.3k, avg=33712.67, stdev=221020.14 lat (usec): min=326, max=2852.2k, avg=44932.45, stdev=254463.05 clat percentiles (usec): | 1.00th=[ 701], 5.00th=[ 750], 10.00th=[ 791], | 20.00th=[ 865], 30.00th=[ 955], 40.00th=[ 1029], | 50.00th=[ 1106], 60.00th=[ 1221], 70.00th=[ 1418], | 80.00th=[ 2212], 90.00th=[ 8979], 95.00th=[ 16909], | 99.00th=[1434452], 99.50th=[1602225], 99.90th=[2499806], | 99.95th=[2566915], 99.99th=[2835350] bw ( KiB/s): min=14307, max=411212, per=100.00%, avg=230380.26, stdev=94722.67, samples=72 iops : min= 13, max= 401, avg=224.57, stdev=92.53, samples=72 lat (usec) : 4=0.01%, 500=0.01%, 750=5.02%, 1000=31.23% lat (msec) : 2=42.64%, 4=4.21%, 10=8.28%, 20=4.36%, 50=1.94% lat (msec) : 100=0.22%, 750=0.04%, 1000=0.29% cpu : usr=0.07%, sys=3.70%, ctx=2300, majf=0, minf=13 IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,8192,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs): WRITE: bw=89.0MiB/s (93.3MB/s), 89.0MiB/s-89.0MiB/s (93.3MB/s-93.3MB/s), io=8192MiB (8590MB), run=92043-92043msec [root@localhost files]# [root@localhost files]# [root@localhost files]# echo 3 > /proc/sys/vm/drop_caches [root@localhost files]# sync [root@localhost files]# [root@localhost files]# [root@localhost files]# fio --name=test --ioengine=libaio --fallocate=none --iodepth=4 --rw=read --bs=1M --size=8G --numjobs=1 --allow_mounted_write=1 --do_verify=1 --verify=pattern --verify_pattern=0x0A0B0C0D --filename=/mnt/ost0/testfile --group_reporting test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=4 fio-3.7 Starting 1 process Jobs: 1 (f=1): [V(1)][100.0%][r=503MiB/s,w=0KiB/s][r=503,w=0 IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=27966: Wed May 19 22:26:59 2021 read: IOPS=353, BW=354MiB/s (371MB/s)(8192MiB/23161msec) slat (usec): min=650, max=502000, avg=2625.01, stdev=13294.92 clat (usec): min=121, max=533083, avg=8483.66, stdev=23148.47 lat (usec): min=1190, max=534382, avg=11111.37, stdev=26851.01 clat percentiles (msec): | 1.00th=[ 4], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 5], | 30.00th=[ 5], 40.00th=[ 6], 50.00th=[ 6], 60.00th=[ 7], | 70.00th=[ 8], 80.00th=[ 9], 90.00th=[ 12], 95.00th=[ 16], | 99.00th=[ 44], 99.50th=[ 117], 99.90th=[ 502], 99.95th=[ 506], | 99.99th=[ 535] bw ( KiB/s): min=39844, max=610304, per=99.63%, avg=360838.15, stdev=151393.14, samples=46 iops : min= 38, max= 596, avg=352.26, stdev=147.95, samples=46 lat (usec) : 250=0.01% lat (msec) : 2=0.01%, 4=19.31%, 10=66.26%, 20=11.69%, 50=1.78% lat (msec) : 100=0.37%, 250=0.34%, 500=0.10%, 750=0.12% cpu : usr=5.66%, sys=39.62%, ctx=33035, majf=0, minf=525 IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=8192,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs): READ: bw=354MiB/s (371MB/s), 354MiB/s-354MiB/s (371MB/s-371MB/s), io=8192MiB (8590MB), run=23161-23161msec [root@localhost files]# [root@localhost files]# [root@localhost files]# echo 3 > /proc/sys/vm/drop_caches [root@localhost files]# [root@localhost files]# [root@localhost files]# [root@localhost files]# [root@localhost files]# dd if=/dev/zero of=file1 bs=1M count=1024 oflag=direct 1024+0 records in 1024+0 records out 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.812825 s, 1.3 GB/s [root@localhost files]# echo 3 > /proc/sys/vm/drop_caches [root@localhost files]# echo 3 > /proc/sys/vm/drop_caches [root@localhost files]# [root@localhost files]# [root@localhost files]# [root@localhost files]# fio --name=test --ioengine=libaio --fallocate=none --iodepth=4 --rw=read --bs=1M --size=8G --numjobs=1 --allow_mounted_write=1 --do_verify=1 --verify=pattern --verify_pattern=0x0A0B0C0D --filename=/mnt/ost0/testfile --group_reporting test: (g=0): rw=read, bs=(R) 1024KiB-1024KiB, (W) 1024KiB-1024KiB, (T) 1024KiB-1024KiB, ioengine=libaio, iodepth=4 fio-3.7 Starting 1 process Jobs: 1 (f=1): [V(1)][100.0%][r=501MiB/s,w=0KiB/s][r=501,w=0 IOPS][eta 00m:00s] test: (groupid=0, jobs=1): err= 0: pid=72740: Wed May 19 22:28:32 2021 read: IOPS=184, BW=185MiB/s (194MB/s)(8192MiB/44367msec) slat (usec): min=548, max=501887, avg=5117.17, stdev=16143.65 clat (usec): min=131, max=1124.1k, avg=16247.34, stdev=36912.74 lat (usec): min=1040, max=1404.9k, avg=21368.78, stdev=45991.99 clat percentiles (msec): | 1.00th=[ 3], 5.00th=[ 4], 10.00th=[ 4], 20.00th=[ 5], | 30.00th=[ 6], 40.00th=[ 7], 50.00th=[ 8], 60.00th=[ 11], | 70.00th=[ 14], 80.00th=[ 19], 90.00th=[ 33], 95.00th=[ 49], | 99.00th=[ 138], 99.50th=[ 218], 99.90th=[ 506], 99.95th=[ 726], | 99.99th=[ 1116] bw ( KiB/s): min= 2043, max=913848, per=99.76%, avg=188614.28, stdev=158911.91, samples=88 iops : min= 1, max= 892, avg=183.91, stdev=155.26, samples=88 lat (usec) : 250=0.01% lat (msec) : 2=0.01%, 4=15.56%, 10=43.88%, 20=22.18%, 50=13.67% lat (msec) : 100=3.20%, 250=1.15%, 500=0.18%, 750=0.10%, 1000=0.02% cpu : usr=3.78%, sys=21.27%, ctx=30665, majf=0, minf=524 IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=8192,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=4
Run status group 0 (all jobs): READ: bw=185MiB/s (194MB/s), 185MiB/s-185MiB/s (194MB/s-194MB/s), io=8192MiB (8590MB), run=44367-44367msec [root@localhost files]#
[root@localhost ~]# zpool status -v 2 pool: pool-oss0 state: ONLINE config:
errors: No known data errors
When there is no IO repeats the same status as above because the pool did not recognize the failure.
While running IO from another terminal, the following status observed:
pool: pool-oss0 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P config:
errors: No known data errors
..............................
pool: pool-oss0 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: resilver (draid2:11d:15c:2s-0) in progress since Wed May 19 22:27:49 2021 9.75G scanned at 714M/s, 8.31G issued 608M/s, 9.75G total 655M resilvered, 100.00% done, 00:00:00 to go config:
errors: No known data errors
pool: pool-oss0 state: DEGRADED status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P scan: scrub in progress since Wed May 19 22:28:03 2021 9.75G scanned at 4.88G/s, 855M issued at 428M/s, 9.75G total 0B repaired, 8.56% done, 00:00:21 to go scan: resilvered (draid2:11d:15c:2s-0) 655M in 00:00:14 with 0 errors on Wed May 19 22:28:03 2021 config:
errors: No known data errors
[root@localhost ~]# zpool clear pool-oss0 [root@localhost ~]# zpool status -v pool: pool-oss0 state: ONLINE scan: scrub repaired 0B in 00:00:20 with 0 errors on Wed May 19 22:28:23 2021 scan: resilvered (draid2:11d:15c:2s-0) 655M in 00:00:14 with 0 errors on Wed May 19 22:28:03 2021 config:
errors: No known data errors
When IO ran from another terminal the following status observed:
[root@localhost ~]# zpool status -v pool: pool-oss0 state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scan: scrub repaired 0B in 00:00:20 with 0 errors on Wed May 19 22:28:23 2021 scan: resilver (draid2:11d:15c:2s-0) in progress since Wed May 19 22:29:00 2021 2.21G scanned at 693M/s, 1.83G issued 575M/s, 9.75G total 100M resilvered, 22.65% done, 00:00:11 to go config:
errors: No known data errors