openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.76k forks source link

zed aborts after assertion failure in udev_device_get_sysattr_value #16705

Closed Uglymotha closed 3 weeks ago

Uglymotha commented 1 month ago

System information

Distribution Name | custom linux Distribution Version | n/a Kernel Version | 6.11.5 Architecture | x86_64 OpenZFS Version | 2.2.6

zed segfaults after assertion failure in udev: Oct 29 16:57:07 rdsan01 zed[18154]: Assertion 'udev_device' failed at src/libudev/libudev-device.c:742, function udev_device_get_sysattr_value(). Aborting. Oct 29 16:57:07 rdsan01 systemd[1]: zfs-zed.service: Main process exited, code=dumped, status=6/ABRT Oct 29 16:57:07 rdsan01 systemd[1]: zfs-zed.service: Failed with result 'core-dump'. Oct 29 16:57:07 rdsan01 systemd[1]: zfs-zed.service: Scheduled restart job, restart counter is at 7. Oct 29 16:57:07 rdsan01 systemd[1]: zfs-zed.service: Start request repeated too quickly. Oct 29 16:57:07 rdsan01 systemd[1]: zfs-zed.service: Failed with result 'core-dump'.

Describe how to reproduce the problem

This happens during udev triggering (udevadm trigger -s block).

Include any warning/errors/backtraces from the system logs

Process 30394 (zed) of user 0 dumped core.

Module libcap.so.2 without build-id. Module libresolv.so.2 without build-id. Module libkeyutils.so.1 without build-id. Module libkrb5support.so.0 without build-id. Module libgmp.so.10 without build-id. Module ld-linux-x86-64.so.2 without build-id. Module libuuid.so.1 without build-id. Module libudev.so.1 without build-id. Module libz.so.1 without build-id. Module libgcc_s.so.1 without build-id. Module libc.so.6 without build-id. Module libunwind.so.8 without build-id. Module libcom_err.so.2 without build-id. Module libk5crypto.so.3 without build-id. Module libkrb5.so.3 without build-id. Module libgssapi_krb5.so.2 without build-id. Module libtirpc.so.3 without build-id. Module libnvpair.so.3 without build-id. Module libcrypto.so.3 without build-id. Module libm.so.6 without build-id. Module libuutil.so.3 without build-id. Module libblkid.so.1 without build-id. Module libzfs_core.so.3 without build-id. Module libzfs.so.4 without build-id. Module zed without build-id. Stack trace of thread 31364:

0 0x00007f17c40e9e7c __pthread_kill_implementation (libc.so.6 + 0x8de7c)

1 0x00007f17c409b3b2 raise (libc.so.6 + 0x3f3b2)

2 0x00007f17c40844ad abort (libc.so.6 + 0x284ad)

3 0x00007f17c3fca995 log_assert_failed.cold (libudev.so.1 + 0x8995)

4 0x00007f17c3ff0077 log_assert_failed_return (libudev.so.1 + 0x2e077)

5 0x00007f17c3fcbc9f udev_device_get_sysattr_value (libudev.so.1 + 0x9c9f)

6 0x0000561ddc78648e zed_udev_monitor (zed + 0xc48e)

7 0x00007f17c40e81b2 start_thread (libc.so.6 + 0x8c1b2)

8 0x00007f17c4162288 __clone3 (libc.so.6 + 0x106288)

Stack trace of thread 30394:

0 0x00007f17c415dfdb ioctl (libc.so.6 + 0x101fdb)

1 0x00007f17c4b2ca2c zpool_events_next (libzfs.so.4 + 0x45a2c)

2 0x0000561ddc786e7b zed_event_service (zed + 0xce7b)

3 0x0000561ddc784bd8 main (zed + 0xabd8)

4 0x00007f17c4085d7a __libc_start_call_main (libc.so.6 + 0x29d7a)

5 0x00007f17c4085e35 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x29e35)

6 0x0000561ddc784561 _start (zed + 0xa561)

Stack trace of thread 31363:

0 0x00007f17c415dfdb ioctl (libc.so.6 + 0x101fdb)

1 0x00007f17c4b133dd zpool_refresh_stats (libzfs.so.4 + 0x2c3dd)

2 0x00007f17c4b26b65 zpool_open_silent (libzfs.so.4 + 0x3fb65)

3 0x00007f17c4b136d0 zpool_iter (libzfs.so.4 + 0x2c6d0)

4 0x0000561ddc78d1a1 zfs_slm_event (zed + 0x131a1)

5 0x0000561ddc78b09b zfs_agent_consumer_thread (zed + 0x1109b)

6 0x00007f17c40e81b2 start_thread (libc.so.6 + 0x8c1b2)

7 0x00007f17c4162288 __clone3 (libc.so.6 + 0x106288)

ELF object binary architecture: AMD x86-64 core.zed.0.e9cc196a28654a98a7139ee0d030939f.30394.1730291497000000.zip

Uglymotha commented 3 weeks ago

mkdir /tmp/a cd /tmp/a xz -dc /boot/ugly-linux-main/initrd-6.11-ugly-linux-main |cpio -di

find . |cpio -H newc -o |xz -T0 --check=crc32 >/boot/ugly-linux-main/initrd-6.11-ugly-linux-main systemctl reboot

texinfo libltdl-dev tk pp (libperl.so -> aarch64-linux-gnu/libperl.so.5...) gawk lzip build-essential bison flex

Found the culprit, in dev_event_nvlist(struct udev_device dev): /

In certain cases, like DM-CRYPT-PLAIN devices there is no parent. if (parent_dev != NULL && (value = udev_device_get_sysattr_value(parent_dev, "size")) Fixes the issue. I will submit a PR for this.

However from my troubleshooting a new question arises. DM_CRYPT_PLAIN devices seem to behave much like multipath devices. First an add is received for the device, followed by a change with the correct information, see log below. Should this EC_DEV_STATUS be handled as a EC_DEV_ADD just like multipath devices? Nov 3 18:04:02 santest zed[2553]: zed_udev_monitor: 0x7fd050002340, add, /dev/dm-4, disk Nov 3 18:04:02 santest zed[2553]: zed_udev_monitor: /dev/dm-4 no devid source

Nov 3 18:04:02 santest zed[2553]: zed_udev_monitor: 0x7fd0500056d0, change, /dev/dm-4, disk Nov 3 18:04:02 santest zed[2553]: #011class: EC_dev_status Nov 3 18:04:02 santest zed[2553]: #011subclass: dev_dle Nov 3 18:04:02 santest zed[2553]: #011dev_name: /dev/dm-4 Nov 3 18:04:02 santest zed[2553]: #011path: /devices/virtual/block/dm-4 Nov 3 18:04:02 santest zed[2553]: #011devid: dm-uuid-CRYPT-PLAIN-storage1 Nov 3 18:04:02 santest zed[2553]: #011phys_path: /dev/disk/by-uuid/3533779146875541629 Nov 3 18:04:02 santest zed[2553]: #011dev_size: 17179869184 Nov 3 18:04:02 santest zed[2553]: #011pool_guid: 3533779146875541629 Nov 3 18:04:02 santest zed[2553]: #011vdev_guid: 11766088279060322789