openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.58k stars 1.75k forks source link

Running ZFS 2.2.0 test suite causes kernel to hang (general protection fault) #15477

Closed paul20230815 closed 2 months ago

paul20230815 commented 12 months ago

System information

Type Version/Name
Distribution Name Ubuntu
Distribution Version 22.04
Kernel Version 5.15.0-71-generic, 6.1.59-vanilla
Architecture amd64: TYAN B8036T65AV4E24HR, Supermicro H8DGT-H, Supermicro X8DTT-H
OpenZFS Version 2.2.0 (2.1.5)

Describe the problem you're observing

I had no problem builing the debian packages from the 2.2.0 release code. I installed the dkms and userland packages after removing the Ubuntu distributed ones. After having kernel modules and userland on version 2.2.0 I ran the test suite as documented on a Tyan Epyc system (using the Ubuntu 5.15.0-71-generic kernel). The first run succeeded with a few unexpected fails. I decided to test again. The second run did not finish, but hang after causing a general protection fault. In order the remove Ubuntu specialties from the equation I built a vanilla kernel 6.1.59 using the Ubuntu config and installed ZFS 2.2.0. Now it took 3 runs of the test suite (Epyc_6.1.59-20231023_zfs-2.2.0-run-1.txt, Epyc_6.1.59-20231023_zfs-2.2.0-run-2.txt, Epyc_6.1.59-20231023_zfs-2.2.0-run-3.txt) to hang the system with a general protection fault. If you compare the results of the first and the second run, the number of FAIL tests increased (huh). Since this is a new platform, I repeated the tests on well-hung hardware in my lab. I could reproduce the behavior on AMD Opteron and Intel Westmere platforms. During the first run of the test suite I observed kernel block info messages in the vdev_autotrim task (see Opteron_trimoff+nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-1.txt), following the advice of a similar bug filed here I decided to change SSDs and modify/disable TRIM and NCQ using kernel options and udev rules. This did not make a huge difference, only the time when the kernel got stuck varied. However, often the second or third run of the test suite hang at the add_prop_ashift test. I also noticed that after successfully finishing the first run of the test suite the system had a load >2 even though it was completely idle. Out of curiosity I decided to run the test suite against the kernel module of ZFS 2.1.5 as supplied by Canonical (userland was still on version 2.2.0). I expected more fails and skipped tests but the test suite hang with a null pointer dereference in the first run (Epyc_5.15.0-71-generic_zfs-2.1.5-run-1.txt).
I'm pretty sure, the test suite works on other systems you have tested, however I failed. We use ZFS in production for years but I never ran the test suite before, so this comes as a surprise ...

PS: in order to disable TRIM I set /sys/block/sdX/queue/discard_max_bytes to 0

Describe how to reproduce the problem

Have Ubuntu 22.04 installed on ZFS (mirrored rpool & bpool). Build and install vanilla LTS kernel 6.1.59 using the config of the generic Ubuntu kernel. Build debian packages from the 2.2.0 release source and install them using dkms. Run the test suite that comes with the source as documented (./scripts/zfs-tests.sh -vx) more than one time.

Include any warning/errors/backtraces from the system logs

The stack trace of the general protection fault looks like this:

[430711.264573] general protection fault, probably for non-canonical address 0xdead000000000122: 0000 [#1] PREEMPT SMP PTI
[430711.276575] CPU: 2 PID: 1945120 Comm: z_trim_iss Tainted: P           OE      6.1.59-20231023 #5
[430711.286506] Hardware name: Supermicro X8DTT-H/X8DTT-H, BIOS 2.2        02/20/2019
[430711.295127] RIP: 0010:vdev_queue_class_add+0x7e/0xf0 [zfs]
[430711.301922] Code: fd 07 75 3c 49 83 fd 08 77 5a 49 c1 e5 05 49 8b 57 18 48 8d 44 03 18 4a 8b 74 2b 10 4c 01 e6 49 89 77 18 48 89 06 48 89 56 08 <48> 89 32 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc
[430711.322888] RSP: 0018:ffffb979c95d3cc8 EFLAGS: 00010282
[430711.329260] RAX: ffff902dda8f3578 RBX: ffff902dda8f3480 RCX: 0000000000000007
[430711.337540] RDX: dead000000000122 RSI: ffff902bff141b80 RDI: ffff902dda8f3480
[430711.345813] RBP: ffffb979c95d3cf8 R08: 0000000000000000 R09: ffff902bff141cb0
[430711.354083] R10: ffff90371c9ec4d0 R11: 0000000000000000 R12: ffff902bff141900
[430711.362336] R13: 00000000000000e0 R14: 0000000000000080 R15: ffff902dda8f3568
[430711.370583] FS:  0000000000000000(0000) GS:ffff9036cfa80000(0000) knlGS:0000000000000000
[430711.379790] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[430711.386640] CR2: 0000560423327208 CR3: 0000000510a10006 CR4: 00000000000206e0
[430711.394886] Call Trace:
[430711.398431]  <TASK>
[430711.401613]  ? __die_body.cold+0x1a/0x1f
[430711.406616]  ? die_addr+0x3e/0x60
[430711.410996]  ? exc_general_protection+0x22d/0x4b0
[430711.416755]  ? asm_exc_general_protection+0x27/0x30
[430711.422683]  ? vdev_queue_class_add+0x7e/0xf0 [zfs]
[430711.428759]  ? recalibrate_cpu_khz+0x10/0x10
[430711.434056]  ? ktime_get_raw_ts64+0x47/0xe0
[430711.439258]  vdev_queue_io+0xc7/0x230 [zfs]
[430711.444613]  zio_vdev_io_start+0x1a8/0x360 [zfs]
[430711.450391]  ? _raw_spin_unlock_irqrestore+0x27/0x50
[430711.456356]  zio_execute+0x97/0x160 [zfs]
[430711.461511]  ? _raw_spin_lock_irqsave+0x28/0x60
[430711.467013]  taskq_thread+0x27d/0x490 [spl]
[430711.472172]  ? wake_up_q+0x90/0x90
[430711.476525]  ? zio_gang_tree_free+0x70/0x70 [zfs]
[430711.482347]  ? taskq_thread_spawn+0x60/0x60 [spl]
[430711.488007]  kthread+0xfd/0x130
[430711.492081]  ? kthread_complete_and_exit+0x20/0x20
[430711.497791]  ret_from_fork+0x22/0x30
[430711.502277]  </TASK>
[430711.505351] Modules linked in: nvme_fabrics nvme_core binfmt_misc intel_powerclamp coretemp kvm_intel kvm intel_cstate ipmi_si ipmi_devintf ipmi_msghandler serio_raw input_leds ioatdma dca i5500_temp i7core_edac mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables x_tables autofs4 zfs(POE) spl(OE) raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ib_uverbs ib_core mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops hid_generic mlx5_core gpio_ich ahci usbhid i2c_i801 mlxfw sha512_ssse3 psmouse drm lpc_ich libahci i2c_smbus psample hid e1000e tls pci_hyperv_intf [last unloaded: scsi_debug]
[430711.582163] ---[ end trace 0000000000000000 ]---
[430711.587559] RIP: 0010:vdev_queue_class_add+0x7e/0xf0 [zfs]
[430711.593958] Code: fd 07 75 3c 49 83 fd 08 77 5a 49 c1 e5 05 49 8b 57 18 48 8d 44 03 18 4a 8b 74 2b 10 4c 01 e6 49 89 77 18 48 89 06 48 89 56 08 <48> 89 32 48 83 c4 08 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc
[430711.614194] RSP: 0018:ffffb979c95d3cc8 EFLAGS: 00010282
[430711.620209] RAX: ffff902dda8f3578 RBX: ffff902dda8f3480 RCX: 0000000000000007
[430711.628130] RDX: dead000000000122 RSI: ffff902bff141b80 RDI: ffff902dda8f3480
[430711.636052] RBP: ffffb979c95d3cf8 R08: 0000000000000000 R09: ffff902bff141cb0
[430711.643961] R10: ffff90371c9ec4d0 R11: 0000000000000000 R12: ffff902bff141900
[430711.651844] R13: 00000000000000e0 R14: 0000000000000080 R15: ffff902dda8f3568
[430711.659722] FS:  0000000000000000(0000) GS:ffff9036cfa80000(0000) knlGS:0000000000000000
[430711.668545] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[430711.675025] CR2: 0000560423327208 CR3: 0000000510a10006 CR4: 00000000000206e0

The files:

Epyc_6.1.59-20231023_zfs-2.2.0-run-1.txt Epyc_6.1.59-20231023_zfs-2.2.0-run-2.txt Epyc_6.1.59-20231023_zfs-2.2.0-run-3.txt Opteron_6.1.59-20231023_zfs-2.2.0-run-1.txt Opteron_6.1.59-20231023_zfs-2.2.0-run-2.txt Opteron_noncqtrim_6.1.59-20231023_zfs-2.2.0-run-1.txt Opteron_noncqtrim_6.1.59-20231023_zfs-2.2.0-run-2.txt Opteron_trimoff+nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-1.txt Opteron_trimoff+nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-2.txt Opteron_trimoff_6.1.59-20231023_zfs-2.2.0-run-1.txt Opteron_trimoff_6.1.59-20231023_zfs-2.2.0-run-2.txt Westmere_nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-1.txt Westmere_nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-2.txt Westmere_nocq+noncqtrim_6.1.59-20231023_zfs-2.2.0-run-3.txt

ZFS 2.1.5 modules and 2.2.0 userland (see above):

Epyc_5.15.0-71-generic_zfs-2.1.5-run-1.txt Westmere_5.15.0-71-generic_zfs-2.1.5-run-1.txt

behlendorf commented 12 months ago

Something very similar was recently hit by the CI:

https://build.openzfs.org/builders/Fedora%2038%20x86_64%20%28TEST%29/builds/900/steps/shell_4/logs/console

paul20230815 commented 11 months ago

After looking at Brian's log I continued testing differently.

Summary:

The results seem to indicate a hard to catch, but serious issue.

Testbed:

To speed things up I switched to the upcoming Ubuntu 24.04 (development) release since it comes with ZFS built-in (currently 2.2.0rc3). The test suite runs on files and loop devices, so it actually tests the ZFS code and stresses the file system underneath. The logfile collected by the testsuite doesn't contain kernel hang/oops messages, I ran the tests with options vxK and collected the syslog. So, I ran the test suite:

Results:

on ext4 root filesystem The first run showed no problems, the second run showed a general protection fault during zpool_split_vdevs. The system remained stable. Run 3 and 4 did not show a problem, the ext4 filesystem was clean after reboot. The test results however where never identical (run-1, run-2, run-3, run-4).

on ext4 root filesystem, but /var/tmp as zpool (default options) The first run showed no problems, the second run showed a general protection fault during zpool_add_dryrun_output. All ZFS operations hang and the machine needed a reboot. The ext4 system needed to replay the journal during boot up.

on ext4 root filesystem, but /var/tmp as zpool (-o version=28) To check whether the features may play a role I repeated the previous test, but created the zpool with -o version=28. The first 3 runs did not show a problem, however the test results were never identical (run-1, run-2, run-3). Run 4 ended with a general protection fault during zpool_split_vdevs. All ZFS operations hang and the machine needed a reboot.

on ZFS root filesystem The Ubuntu installer has an experimental feature to create the root filesystem on ZFS. I used it to do an install and ran the ZFS test suite on a ZFS root filesystem. During the first run the system showed several hung_taks INFO messages during the zpool_add_004_pos test, further tests could not run because all ZFS operations hung. The system load was 6 without any user activity, strace of the process in question got stuck (scroll to the end). I repeated the test after rebooting with very similar results: hung_taks INFO messages during the zfs_copies_003_pos test, this time a few more tests finished after the hung_task messages but the system got stuck with load 16 later on and all ZFS operations hang (scroll to the end).

on my laptop with Ubuntu 20.04 Since I suspected a strange error lurking around for a longer time I ran the test suite of ZFS 0.8.3 on my Ubuntu laptop (ZFS root filesystem). It also got stuck during the tests, but I couldn't save the output, I ran top in another terminal, it showed load 33 with /usr/sbin/zed -F and sbin/mdadm --examine --scan --config=partitions running at 100% CPU.

on ext4 root but with ZFS bwatkinson directIO master from 11/10 To verify the latest changes I took the directIO branch (cause that's what I'm waiting for ...) and ran the test suite on an ext4 root file system. I got lots of UBSAN messages, but the test suite ran to the zpool_add_001_pos, showed hung_task INFO messages and got stuck with load 4 and a l2arc_feed process at 100% CPU (scroll to the end).

The files:

(the dmesg files are actually taken from a filtered syslog, don't be confused ;-)

Kitt3120 commented 11 months ago

Wow, that's a lot. I hope that this will be helpful to the devs. Good job!

paul20230815 commented 11 months ago

Update: since the problem is hard to reproduce I focused on making it easier to catch.

1. test on real hard disks

20 iterations of the test suite did not show a problem! Test environment: IBM x3650, Ubuntu 24.04, Ubuntu kernel 6.5, root on ext4, tests on ZFS (mirrored stripe of 4 300GB SAS disks on a LSI 2008 IT controller). Promising :-)

2. test different underlying file systems

Test environment: Tyan S8036GM2NE, Ubuntu 24.04, Ubuntu kernel 6.5, root on ext4, test suite ran on /var/zfs)

/var/zfs on ext4 - general protection fault in the second run during zpool_split_dryrun_output (dmesg-run-2.log)

/var/zfs on tmpfs - general protection fault in the eighth run during zpool_split_vdevs (dmesg-run-3-8.log)

I suspected the trim code being a difference between SSDs and disks an so I modified the ZFS code (module/os/linux/zfs)/vdev_file.c vd->vdev_has_trim = B_FALSE; module/os/linux/zfs/vdev_disk.c v->vdev_has_trim = B_FALSE). That made all trim tests FAIL, but it did not help :-(

3. test new release 2.2.1

Due to the data corruption bug that kept most of you busy I switched to release 2.2.1 using kernel 6.1.63 and applied the patch (#15571) manually. (Test environment: Tyan S8036GM2NE, Ubuntu 24.04, vanilla kernel 6.1.63, root on ext4, test suite ran on /var/zfs backed by ZFS [mirrored stripe of 4 NVMe SSDs foo_details.txt] )

The test suite repeatedly got stuck in the first iteration during zpool_add_004_pos ! It looks like this:

2023-12-01T10:41:29.411111+01:00 neo4 kernel: [  280.086777] ZTS run /usr/share/zfs/zfs-tests/tests/functional/cli_root/zpool_add/zpool_add_004_pos
2023-12-01T10:41:29.489401+01:00 neo4 zed: eid=614 class=pool_create pool='testpool'
2023-12-01T10:41:29.562934+01:00 neo4 zed: eid=655 class=config_sync pool='testpool'
2023-12-01T10:41:29.582807+01:00 neo4 zed: eid=656 class=pool_create pool='testpool1'
2023-12-01T10:41:29.649915+01:00 neo4 zed: eid=697 class=config_sync pool='testpool1'
2023-12-01T10:41:31.022738+01:00 neo4 kernel: [  281.702116]  zd0: p1 p9
2023-12-01T10:41:31.245478+01:00 neo4 zed: eid=700 class=config_sync pool='testpool'
2023-12-01T10:41:31.286425+01:00 neo4 zed: eid=701 class=config_sync pool='testpool'
2023-12-01T10:41:31.288226+01:00 neo4 zed: eid=702 class=vdev_add pool='testpool'
2023-12-01T10:41:41.307137+01:00 neo4 zed: eid=703 class=vdev.unknown pool='testpool' vdev=testvol-part1
2023-12-01T10:41:41.308527+01:00 neo4 zed: eid=704 class=statechange pool='testpool' vdev=testvol-part1 vdev_state=UNAVAIL
2023-12-01T10:41:41.314477+01:00 neo4 zed: error: statechange-notify.sh: eid=704: "mail" not installed
2023-12-01T10:41:42.350104+01:00 neo4 systemd[1]: testpool.mount: Deactivated successfully.
2023-12-01T10:41:42.375666+01:00 neo4 zed: eid=705 class=pool_destroy pool='testpool' pool_state=DESTROYED
2023-12-01T10:41:42.386800+01:00 neo4 kernel: [  293.065357] WARNING: Pool 'testpool' has encountered an uncorrectable I/O failure and has been suspended.
2023-12-01T10:41:42.386820+01:00 neo4 kernel: [  293.065357] 
2023-12-01T10:41:42.392563+01:00 neo4 zed: eid=706 class=io_failure pool='testpool' pool_state=DESTROYED
2023-12-01T10:44:56.843605+01:00 neo4 kernel: [  487.510968] INFO: task agents:2288 blocked for more than 120 seconds.
2023-12-01T10:44:56.843625+01:00 neo4 kernel: [  487.517496]       Tainted: P           OE      6.6.2-20231123 #1
2023-12-01T10:44:56.843640+01:00 neo4 kernel: [  487.523590] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2023-12-01T10:44:56.860947+01:00 neo4 kernel: [  487.531522] task:agents          state:D stack:0     pid:2288  ppid:1      flags:0x00000002
2023-12-01T10:44:56.860960+01:00 neo4 kernel: [  487.531541] Call Trace:
2023-12-01T10:44:56.860962+01:00 neo4 kernel: [  487.531551]  <TASK>
2023-12-01T10:44:56.860963+01:00 neo4 kernel: [  487.531569]  __schedule+0x422/0x1680
2023-12-01T10:44:56.860964+01:00 neo4 kernel: [  487.531587]  ? init_misc_binfmt+0x9d/0xff0 [binfmt_misc]
2023-12-01T10:44:56.860965+01:00 neo4 kernel: [  487.531607]  ? stack_trace_call+0x9c/0x5c0
2023-12-01T10:44:56.860965+01:00 neo4 kernel: [  487.531623]  ? post_alloc_hook+0xcd/0x120
2023-12-01T10:44:56.860966+01:00 neo4 kernel: [  487.531657]  schedule+0x63/0x110
2023-12-01T10:44:56.860967+01:00 neo4 kernel: [  487.531672]  schedule_preempt_disabled+0x15/0x30
2023-12-01T10:44:56.860968+01:00 neo4 kernel: [  487.531685]  __mutex_lock.constprop.0+0x42f/0x740
2023-12-01T10:44:56.860968+01:00 neo4 kernel: [  487.531705]  ? __mutex_lock_slowpath+0x5/0x20
2023-12-01T10:44:56.860969+01:00 neo4 kernel: [  487.531728]  __mutex_lock_slowpath+0x13/0x20
2023-12-01T10:44:56.860970+01:00 neo4 kernel: [  487.531740]  mutex_lock+0x3c/0x50
2023-12-01T10:44:56.860971+01:00 neo4 kernel: [  487.531755]  spa_open_common+0x61/0x450 [zfs]
2023-12-01T10:44:56.860971+01:00 neo4 kernel: [  487.532092]  ? spa_open_common+0x5/0x450 [zfs]
2023-12-01T10:44:56.860972+01:00 neo4 kernel: [  487.532410]  spa_get_stats+0x57/0x430 [zfs]
2023-12-01T10:44:56.860973+01:00 neo4 kernel: [  487.532707]  ? srso_alias_return_thunk+0x5/0x7f
2023-12-01T10:44:56.860974+01:00 neo4 kernel: [  487.532734]  ? spa_get_stats+0x5/0x430 [zfs]
2023-12-01T10:44:56.860974+01:00 neo4 kernel: [  487.533037]  zfs_ioc_pool_stats+0x41/0xa0 [zfs]
2023-12-01T10:44:56.860975+01:00 neo4 kernel: [  487.533343]  zfsdev_ioctl_common+0x8a9/0x9f0 [zfs]
2023-12-01T10:44:56.860976+01:00 neo4 kernel: [  487.533659]  zfsdev_ioctl+0x57/0xf0 [zfs]
2023-12-01T10:44:56.860977+01:00 neo4 kernel: [  487.533932]  __x64_sys_ioctl+0xa3/0xf0
2023-12-01T10:44:56.860977+01:00 neo4 kernel: [  487.533953]  do_syscall_64+0x5c/0x90
2023-12-01T10:44:56.860978+01:00 neo4 kernel: [  487.533964]  ? srso_alias_return_thunk+0x5/0x7f
2023-12-01T10:44:56.860979+01:00 neo4 kernel: [  487.533976]  ? exc_page_fault+0x94/0x1b0
2023-12-01T10:44:56.860980+01:00 neo4 kernel: [  487.533994]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
2023-12-01T10:44:56.860980+01:00 neo4 kernel: [  487.534007] RIP: 0033:0x7985057238ef
2023-12-01T10:44:56.860981+01:00 neo4 kernel: [  487.534032] RSP: 002b:00007984ffff8450 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
2023-12-01T10:44:56.860982+01:00 neo4 kernel: [  487.534046] RAX: ffffffffffffffda RBX: 00007984f80031b0 RCX: 00007985057238ef
2023-12-01T10:44:56.860982+01:00 neo4 kernel: [  487.534055] RDX: 00007984ffff84d0 RSI: 0000000000005a05 RDI: 0000000000000009
2023-12-01T10:44:56.860983+01:00 neo4 kernel: [  487.534063] RBP: 00007984ffffbac0 R08: ffffffffffffdf20 R09: 0000000000000000
2023-12-01T10:44:56.860984+01:00 neo4 kernel: [  487.534070] R10: 0000000000000007 R11: 0000000000000246 R12: 00007984f8023840
2023-12-01T10:44:56.860984+01:00 neo4 kernel: [  487.534078] R13: 00007984ffff84d0 R14: 00007984ffffbae4 R15: 0000558b798abfb0
2023-12-01T10:44:56.860985+01:00 neo4 kernel: [  487.534114]  </TASK>

No further ZFS operations are possible, when shut down the system can't umount the file systems and shutdown is forced by systemd (zfs_2.2.1_run1.txt). I removed this test from the test suite and it got stuck in zpool_add_003_pos in the third iteration. I also removed zpool_add_003_pos from the test suite and it ran through one time, but with hung_tasks in the procfs/pool_state test, the second time it showed hung_tasks again in the procfs/pool_state test and hang after a general protection fault in the zpool_split_vdevs test (zfs_2.2.1p_syslog.txt).

5. test release 2.2.2

After release 2.2.2 was out today, I tested if the problem is still around. I cut down the test suite to just the ones of functional/cli_root/zpool_add (bug.run) and could immediately verify that zpool_add_004_pos makes ZFS hang (zfs_2.2.2_syslog). After removing zpool_add_004_pos, I ran > 100 iterations of the functional/cli_root/zpool_add tests - no problem.

My summary/suspicion:

NOTICE: l2arc_write_max or l2arc_write_boost plus the overhead of log blocks (persistent L2ARC, 520911519744 bytes) exceeds the size of the cache device (guid 2142287040652473574), resetting them to the default (8388608)

While data corruption is a truly disturbing problem in a file system, stability problems are almost as concerning if you want to run production workloads.

paul20230815 commented 2 months ago

I'm closing this issue. I tested release 2.2.3 of OpenZFS and I could not reproduce the issues reported here. While the problem is probably still in the older ZFS releases, I now found a stable release I can trust for production, hence I no longer need a fix for release 2.2.0. I should note though, that the successful tests where made with ZED disabled (see #16468)