Closed askfongjojo closed 1 year ago
The guest boot issue is not limited to first boot (though it seems to happen more often, anedotally, 6-8% of the time). The guest also segfaulted when put in a reboot loop (and came up just fine in another subsequent reboot), e.g.
...
Jun 29 01:12:49 sysbench-mysql7 multipathd[419]: --------start up--------
Jun 29 01:12:49 sysbench-mysql7 multipathd[419]: read /etc/multipath.conf
Jun 29 01:12:49 sysbench-mysql7 multipathd[419]: path checkers start up
Jun 29 01:12:49 sysbench-mysql7 systemd[1]: Starting Rule-based Manager for Device Events and Files...
Jun 29 01:12:49 sysbench-mysql7 systemd-udevd[430]: systemd: malloc.c:4302: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
Jun 29 01:12:49 sysbench-mysql7 systemd[1]: Started .
Jun 29 01:12:49 sysbench-mysql7 kernel: systemd[1]: segfault at b8 ip 000055f7af1e26b4 sp 00007fffe05712a0 error 4 in systemd[55f7af105000+e0000]
Jun 29 01:12:49 sysbench-mysql7 kernel: Code: 00 00 00 00 00 90 48 8b 8c 24 a0 00 00 00 48 8b 7c 24 70 31 d2 48 89 de e8 89 74 f2 ff 84 c0 0f 84 1f fd ff ff 48 8b 44 24 08 <48> 8b b8 b8 00 00 00 48 85 ff 74 d0 e8 5b 0d f6 ff 48 8b 44 24 08
Jun 29 01:12:49 sysbench-mysql7 systemd[1]: Caught <SEGV>, dumped core as pid 432.
Jun 29 01:12:49 sysbench-mysql7 systemd[1]: Freezing execution.
-- Boot bec512c090df44b586508f845e8fbe18 --
Jun 29 01:15:06 sysbench-mysql7 kernel: Linux version 5.15.0-71-generic (buildd@lcy02-amd64-044) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #78-Ubuntu SMP Tue Apr 18 09:00:29 UTC 2023 (Ubuntu 5.15.0-71.78-generic 5.15.92)
Jun 29 01:15:06 sysbench-mysql7 kernel: Command line: BOOT_IMAGE=/boot/vmlinuz-5.15.0-71-generic root=LABEL=cloudimg-rootfs ro console=tty1 console=ttyS0
...
...
Jun 29 01:42:12 sysbench-mysql7 systemd[1]: Starting Set console font and keymap...
Jun 29 01:42:12 sysbench-mysql7 systemd[1]: Starting Create final runtime dir for shutdown pivot root...
Jun 29 01:42:12 sysbench-mysql7 systemd[1]: Starting Tell Plymouth To Write Out Runtime Data...
Jun 29 01:42:12 sysbench-mysql7 systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
Jun 29 01:42:12 sysbench-mysql7 systemd[1]: Condition check resulted in Store a System Token in an EFI Variable being skipped.
Jun 29 01:42:12 sysbench-mysql7 systemd[1]: Condition check resulted in Commit a transient machine-id on disk being skipped.
Jun 29 01:42:12 sysbench-mysql7 systemd[1]: Starting Create Volatile Files and Directories...
Jun 29 01:42:12 sysbench-mysql7 systemd[1]: Starting Uncomplicated firewall...
Jun 29 01:42:12 sysbench-mysql7 kernel: systemd-tmpfile[532]: segfault at 55c17df37000 ip 000055c17df37000 sp 00007ffca8325a58 error 15 in systemd-tmpfiles[55c17df37000+4000]
Jun 29 01:42:12 sysbench-mysql7 kernel: Code: Unable to access opcode bytes at RIP 0x55c17df36fd6.
Jun 29 01:42:12 sysbench-mysql7 systemd[1]: Finished Set console font and keymap.
...
...
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Starting Set console font and keymap...
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Starting Create final runtime dir for shutdown pivot root...
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Starting Tell Plymouth To Write Out Runtime Data...
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Condition check resulted in Set Up Additional Binary Formats being skipped.
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Condition check resulted in Store a System Token in an EFI Variable being skipped.
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Condition check resulted in Commit a transient machine-id on disk being skipped.
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Starting Create Volatile Files and Directories...
Jun 29 01:53:58 sysbench-mysql7 apparmor.systemd[527]: Restarting AppArmor
Jun 29 01:53:58 sysbench-mysql7 apparmor.systemd[527]: Reloading AppArmor profiles
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Starting Uncomplicated firewall...
Jun 29 01:53:58 sysbench-mysql7 kernel: systemd-tmpfile[536]: segfault at 5624c56a8f83 ip 00005624840f7231 sp 00007ffe64e18420 error 4 in systemd-tmpfiles[5624840f0000+e000]
Jun 29 01:53:58 sysbench-mysql7 kernel: Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0f <84> 80 03 00 00 41 0f b6 44 24 01 45 31 c9 31 ff 45 31 c0 84 c0 0f
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Finished Set console font and keymap.
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Finished Create final runtime dir for shutdown pivot root.
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Finished Tell Plymouth To Write Out Runtime Data.
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: systemd-tmpfiles-setup.service: Main process exited, code=killed, status=11/SEGV
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: systemd-tmpfiles-setup.service: Failed with result 'signal'.
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Failed to start Create Volatile Files and Directories.
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Finished Uncomplicated firewall.
Jun 29 01:53:58 sysbench-mysql7 systemd[1]: Starting Network Time Synchronization...
...
Using propolis-standalone on commodity hardware, I've reproduced something similar to this a few times:
[ 4.591020] systemd[1]: segfault at 10 ip 00007fc6465e0b41 sp 00007ffccd3249f0 error 6 in libc.so.6[7fc646565000+195000]
[ 4.592577] Code: 00 48 39 ce 0f 84 07 06 00 00 4c 8b 61 18 4a 83 4c 29 08 01 4c 39 cb 74 05 48 83 49 08 04 4c 8b 34 24 4c 89 62 08 41 83 c2 01 <49> 89 74 24 10 4c 89 71 18 4c 8d 71 10 4c 89 74 24 08 49 c1 ee 0c
[ 4.935831] systemd[1]: segfault at 2b ip 00007fb3c0868040 sp 00007ffe02870c38 error 6 in libsystemd-shared-249.so[7fb3c0766000+1a9000]
[ 4.939153] Code: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 <00> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 5.177656] systemd[1]: segfault at 0 ip 00007ff3ffa83089 sp 00007fff14f61820 error 4 in libsystemd-shared-249.so[7ff3ff9a8000+1a9000]
[ 5.181300] Code: 48 89 44 24 38 31 c0 f6 47 2f 04 48 8d 05 6f a1 18 00 48 89 e5 48 0f 44 f0 48 89 ef e8 50 99 01 00 48 8b 03 48 89 ee 4c 89 e7 <ff> 10 48 89 ef e8 4d 9c 01 00 0f b6 53 2f f6 c2 04 74 24 8b 4b 24
For two of them (so far), I also followed up to check that the %rip
of the fault corresponded with seemingly sensible program text:
[ 6.695669] systemd[1]: segfault at 10 ip 00007f68a0fd5b41 sp 00007ffc956aa800 error 6 in libc.so.6[7f68a0f5a000+195000]
[ 6.699615] Code: 00 48 39 ce 0f 84 07 06 00 00 4c 8b 61 18 4a 83 4c 29 08 01 4c 39 cb 74 05 48 83 49 08 04 4c 8b 34 24 4c 89 62 08 41 83 c2 01 <49> 89 74 24 10 4c 89 71 18 4c 8d 71 10 4c 89 74 24 08 49 c1 ee 0c
[ 116.498129] systemd-journald[376]: Failed to send WATCHDOG=1 notification message: Connection refused
From objdump:
a3b0f: 4d 39 dc cmp %r11,%r12
a3b12: 0f 83 10 06 00 00 jae a4128 <__default_morecore@GLIBC_2.2.5+0x1af8>
a3b18: 48 39 ce cmp %rcx,%rsi
a3b1b: 0f 84 07 06 00 00 je a4128 <__default_morecore@GLIBC_2.2.5+0x1af8>
a3b21: 4c 8b 61 18 mov 0x18(%rcx),%r12
a3b25: 4a 83 4c 29 08 01 orq $0x1,0x8(%rcx,%r13,1)
a3b2b: 4c 39 cb cmp %r9,%rbx
a3b2e: 74 05 je a3b35 <__default_morecore@GLIBC_2.2.5+0x1505>
a3b30: 48 83 49 08 04 orq $0x4,0x8(%rcx)
a3b35: 4c 8b 34 24 mov (%rsp),%r14
a3b39: 4c 89 62 08 mov %r12,0x8(%rdx)
a3b3d: 41 83 c2 01 add $0x1,%r10d
a3b41: 49 89 74 24 10 mov %rsi,0x10(%r12) <<<<<<<< instruction in question
a3b46: 4c 89 71 18 mov %r14,0x18(%rcx)
a3b4a: 4c 8d 71 10 lea 0x10(%rcx),%r14
a3b4e: 4c 89 74 24 08 mov %r14,0x8(%rsp)
a3b53: 49 c1 ee 0c shr $0xc,%r14
a3b57: 4f 33 34 f8 xor (%r8,%r15,8),%r14
a3b5b: 4c 89 71 10 mov %r14,0x10(%rcx)
a3b5f: 4c 8b 74 24 08 mov 0x8(%rsp),%r14
a3b64: 4c 89 e1 mov %r12,%rcx
a3b67: 4f 89 34 f8 mov %r14,(%r8,%r15,8)
a3b6b: 66 45 89 14 78 mov %r10w,(%r8,%rdi,2)
a3b70: eb 99 jmp a3b0b <__default_morecore@GLIBC_2.2.5+0x14db>
a3b72: 66 0f 1f 44 00 00 nopw 0x0(%rax,%rax,1)
[ 166.630969] systemd[1]: segfault at 580 ip 000055e0c1ed7577 sp 00007fff1b1631a0 error 4 in systemd[55e0c1e4d000+e0000]
[ 166.632507] Code: 01 00 00 48 63 de 83 fb 08 0f 8f e4 01 00 00 83 fb 05 0f 8f 43 02 00 00 49 89 d7 48 85 d2 0f 84 5f 02 00 00 48 8b 02 49 89 fd <44> 8b 98 80 05 00 00 45 85 db 0f 8f c9 00 00 00 41 8b 47 0c 8d 50
From objdump:
c054a: 0f 84 c8 01 00 00 je c0718 <log_oom_internal@plt+0x81548>
c0550: 48 63 de movslq %esi,%rbx
c0553: 83 fb 08 cmp $0x8,%ebx
c0556: 0f 8f e4 01 00 00 jg c0740 <log_oom_internal@plt+0x81570>
c055c: 83 fb 05 cmp $0x5,%ebx
c055f: 0f 8f 43 02 00 00 jg c07a8 <log_oom_internal@plt+0x815d8>
c0565: 49 89 d7 mov %rdx,%r15
c0568: 48 85 d2 test %rdx,%rdx
c056b: 0f 84 5f 02 00 00 je c07d0 <log_oom_internal@plt+0x81600>
c0571: 48 8b 02 mov (%rdx),%rax
c0574: 49 89 fd mov %rdi,%r13
c0577: 44 8b 98 80 05 00 00 mov 0x580(%rax),%r11d <<<<<<<<<<<<< instr in question
c057e: 45 85 db test %r11d,%r11d
c0581: 0f 8f c9 00 00 00 jg c0650 <log_oom_internal@plt+0x81480>
c0587: 41 8b 47 0c mov 0xc(%r15),%eax
c058b: 8d 50 ff lea -0x1(%rax),%edx
c058e: 83 fa 05 cmp $0x5,%edx
c0591: 41 0f 96 c6 setbe %r14b
c0595: 83 f8 05 cmp $0x5,%eax
c0598: 0f 95 c0 setne %al
c059b: 41 20 c6 and %al,%r14b
With some rudimentary automation built around repeatedly booting that image under propolis, I've observed systemd-involved segfaults now on both Intel (Ivy Bridge) and AMD (Rome) hardware. As of the time of this writing, the metrics from the tests are as follows:
AMD: 446 tests - 4 segfaults - 24 boot timeouts
uart-1688683832.log:[ 4.589254] systemd[1]: segfault at 8 ip 000056360a6802e1 sp 00007ffefe259d50 error 4 in systemd[56360a5f4000+e0000]
uart-1688683904.log:[ 10.626955] systemd[1]: segfault at 580 ip 000056481592d577 sp 00007ffcaeeee9c0 error 4 in systemd[5648158a3000+e0000]
uart-1688685336.log:[ 7.870240] systemd[1]: segfault at b8 ip 000055d397d33fa7 sp 00007ffd1268fb40 error 4 in systemd[55d397ce8000+e0000]
uart-1688690483.log:[ 4.849956] systemd[1]: segfault at 0 ip 00007fe1f4d26089 sp 00007ffd32098e20 error 4 in libsystemd-shared-249.so[7fe1f4c4b000+1a9000]
Intel: 212 tests - 4 segfaults - 52 boot timeouts
uart-1688689213.log:[ 25.843751] systemd[1]: segfault at 240 ip 000055c2efdd16b1 sp 00007fffa2b14500 error 4 in systemd[55c2efd86000+e0000]
uart-1688689706.log:[ 15.595984] systemd[1]: segfault at 0 ip 00007fc21ec1e089 sp 00007fff79488df0 error 4 in libsystemd-shared-249.so[7fc21eb43000+1a9000]
uart-1688700517.log:[ 16.210273] systemd[1]: segfault at 10 ip 00007fe1b6165a9e sp 00007fff58ca01a0 error 4 in libc.so.6[7fe1b60ea000+195000]
uart-1688701468.log:[ 12.719193] systemd[1]: segfault at 0 ip 00007fb33f6ac089 sp 00007ffd9b427b80 error 4 in libsystemd-shared-249.so[7fb33f5d1000+1a9000]
The check for boot timeouts is extremely crude: Guests which do not appear to reach the login prompt (in a way the harness recognizes) within 150 seconds of being started are considered "timed out" and are killed.
Updates from this weekend: I updated my testing infrastructure to replicate the same conditions with C-bhyve. Running there, after 100s of boot attempts, no hangs or systemd segfaults were observed.
Further testing was done with propolis-standalone, swapping in virtio-block for the boot disk instead of nvme. (The cloudinit disk has always been virtio-block in my test setup). After 100s of boots, no hangs or systemd segfaults occurred. I switched back to nvme for the boot disk and after fewer than 100 boots, there were multiple recorded cases of systemd segfaults and hangs (when the system would fail to reach the login prompt withing 150 seconds).
Following this, I added some additional probes to the nvme logic in hopes of finding something which "looked off". Comparing traces from successful boots to those where segfaults occurred has not revealed any smoking guns as of yet.
On suggestion from Robert, I ran my repro setup with the NVMe emulation modified to report a limit of 1 IO queue (really a single pair of SQ/CQ). Doing so would bring it closer in line with how virtio-block behaves, since it has only the single virtqueue for requests. Even with the arbitrary limit in place for nvme (confirmed via dtrace, where IO requests solely bore a qid of 1), the segfaults and hangs were still present.
I ran a 50-reboot loop of an Ubuntu 22.04 image in PHD that hit several segfaults (including one systemd
pid-1 segfault) and boot timeouts. Some of the other victims were multipathd
and fsck
; at least the former failure is correlated with boot timing out. Fortunately, I was able to extract the disk from this VM and so get access to all the juicy core dumps it saved.
The systemd
version in my guest is
ubuntu@ubuntu:/$ systemd --version
systemd 249 (249.11-0ubuntu3.9)
+PAM +AUDIT +SELINUX +APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 -PWQUALITY -P11KIT -QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD -XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified
The systemd
core dump shows a failure at the following stack:
(gdb) bt
#0 0x00007f14b33e075b in kill () at ../sysdeps/unix/syscall-template.S:120
#1 0x000055f404852b1c in crash (sig=11) at ../src/core/main.c:257
#2 <signal handler called>
#3 transaction_add_job_and_dependencies (tr=tr@entry=0x55f4064590b0, type=type@entry=JOB_START, unit=0x55f4063c7170, by=by@entry=0x55f4064ddc10, matters=matters@entry=true, conflicts=conflicts@entry=false, ignore_requirements=false, ignore_order=false, e=0x7ffd3ba36420) at ../src/core/transaction.c:924
#4 0x000055f4048d1478 in transaction_add_job_and_dependencies (tr=tr@entry=0x55f4064590b0, type=type@entry=JOB_START, unit=<optimized out>, by=by@entry=0x55f4063f2170, matters=matters@entry=true, conflicts=conflicts@entry=false, ignore_requirements=false, ignore_order=false, e=0x7ffd3ba36420)
at ../src/core/transaction.c:1005
#5 0x000055f4048d1478 in transaction_add_job_and_dependencies (tr=tr@entry=0x55f4064590b0, type=type@entry=JOB_START, unit=<optimized out>, by=by@entry=0x55f4063e7c70, matters=matters@entry=false, conflicts=conflicts@entry=false, ignore_requirements=false, ignore_order=false, e=0x7ffd3ba36420)
at ../src/core/transaction.c:1005
#6 0x000055f4048d15a4 in transaction_add_job_and_dependencies (tr=tr@entry=0x55f4064590b0, type=type@entry=JOB_START, unit=<optimized out>, by=by@entry=0x55f4064da340, matters=matters@entry=true, conflicts=conflicts@entry=false, ignore_requirements=false, ignore_order=false, e=0x7ffd3ba36420)
at ../src/core/transaction.c:1015
#7 0x000055f4048d1478 in transaction_add_job_and_dependencies (tr=0x55f4064590b0, type=<optimized out>, unit=<optimized out>, by=<optimized out>, matters=<optimized out>, conflicts=<optimized out>, ignore_requirements=false, ignore_order=false, e=0x7ffd3ba36420) at ../src/core/transaction.c:1005
#8 0x000055f404896120 in manager_add_job (m=0x55f406341100, type=JOB_START, unit=0x55f40651ee40, mode=JOB_REPLACE, affected_jobs=0x0, error=0x7ffd3ba36420, ret=0x0) at ../src/core/manager.c:1858
#9 0x000055f404867cd2 in signal_activation_request (message=0x55f40653f040, userdata=0x55f406341100, ret_error=<optimized out>) at ../src/core/dbus.c:180
#10 0x00007f14b3846dd4 in bus_match_run (bus=0x55f4064c1a50, node=0x55f4065125b0, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:318
#11 0x00007f14b384698f in bus_match_run (bus=0x55f4064c1a50, node=0x55f406512570, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:429
#12 0x00007f14b384696a in bus_match_run (bus=0x55f4064c1a50, node=0x55f406512530, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:407
#13 0x00007f14b384698f in bus_match_run (bus=0x55f4064c1a50, node=0x55f4065124f0, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:429
#14 0x00007f14b384696a in bus_match_run (bus=0x55f4064c1a50, node=0x55f4065124b0, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:407
#15 0x00007f14b384698f in bus_match_run (bus=0x55f4064c1a50, node=0x55f406512470, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:429
#16 0x00007f14b384696a in bus_match_run (bus=0x55f4064c1a50, node=0x55f40650f4d0, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:407
#17 0x00007f14b384698f in bus_match_run (bus=0x55f4064c1a50, node=0x55f40650f490, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:429
#18 0x00007f14b384696a in bus_match_run (bus=0x55f4064c1a50, node=0x55f40650f450, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:407
#19 0x00007f14b384698f in bus_match_run (bus=0x55f4064c1a50, node=0x55f40650f410, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:429
#20 0x00007f14b3846903 in bus_match_run (bus=0x55f4064c1a50, node=0x55f4064c7210, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:417
#21 0x00007f14b3846e3b in bus_match_run (bus=0x55f4064c1a50, node=0x55f4064c1ac8, m=0x55f40653f040) at ../src/libsystemd/sd-bus/bus-match.c:274
#22 0x00007f14b38660c8 in process_match (m=<optimized out>, bus=<optimized out>) at ../src/libsystemd/sd-bus/sd-bus.c:2840
#23 process_match (bus=bus@entry=0x55f4064c1a50, m=m@entry=0x55f40653f040) at ../src/libsystemd/sd-bus/sd-bus.c:2831
#24 0x00007f14b3869643 in process_message (m=0x55f40653f040, bus=0x55f4064c1a50) at ../src/libsystemd/sd-bus/sd-bus.c:2955
#25 process_running (ret=0x0, bus=0x55f4064c1a50) at ../src/libsystemd/sd-bus/sd-bus.c:3005
#26 bus_process_internal (bus=bus@entry=0x55f4064c1a50, ret=ret@entry=0x0) at ../src/libsystemd/sd-bus/sd-bus.c:3225
#27 0x00007f14b3869b09 in sd_bus_process (bus=bus@entry=0x55f4064c1a50, ret=ret@entry=0x0) at ../src/libsystemd/sd-bus/sd-bus.c:3252
#28 0x00007f14b386a0f1 in io_callback (s=<optimized out>, fd=<optimized out>, revents=<optimized out>, userdata=0x55f4064c1a50) at ../src/libsystemd/sd-bus/sd-bus.c:3603
#29 0x00007f14b38bba10 in source_dispatch (s=s@entry=0x55f40650c110) at ../src/libsystemd/sd-event/sd-event.c:3526
#30 0x00007f14b38bbcfd in dispatch_exit (e=<optimized out>) at ../src/libsystemd/sd-event/sd-event.c:3686
#31 sd_event_dispatch (e=e@entry=0x55f406349f00) at ../src/libsystemd/sd-event/sd-event.c:4103
#32 0x00007f14b38bd2b8 in sd_event_run (e=0x55f406349f00, timeout=18446744073709551615) at ../src/libsystemd/sd-event/sd-event.c:4171
#33 0x000055f404897a9b in manager_loop (m=0x55f406341100) at ../src/core/manager.c:3016
#34 0x000055f404850332 in invoke_main_loop (ret_error_message=0x7ffd3ba36ee0, ret_switch_root_init=<synthetic pointer>, ret_switch_root_dir=<synthetic pointer>, ret_fds=0x7ffd3ba36ed0, ret_shutdown_verb=<synthetic pointer>, ret_retval=<synthetic pointer>, ret_reexecute=<synthetic pointer>,
saved_rlimit_memlock=0x7ffd3ba36f10, saved_rlimit_nofile=0x7ffd3ba36f20, m=0x55f406341100) at ../src/core/main.c:1897
#35 main (argc=1, argv=0x7ffd3ba371a8) at ../src/core/main.c:2906
Here is the source file from what I think is the correct version of systemd for the image above: https://git.launchpad.net/~ubuntu-core-dev/ubuntu/+source/systemd/tree/src/core/transaction.c?id=a7ad4a9fc708500c61e3b8127f112d8c90049b2c
The failing line in frame 3 is an attempt to dereference unit->manager
, and sure enough we have
(gdb) frame 3
#3 transaction_add_job_and_dependencies (tr=tr@entry=0x55f4064590b0, type=type@entry=JOB_START, unit=0x55f4063c7170, by=by@entry=0x55f4064ddc10, matters=matters@entry=true, conflicts=conflicts@entry=false, ignore_requirements=false, ignore_order=false, e=0x7ffd3ba36420) at ../src/core/transaction.c:924
924 in ../src/core/transaction.c
(gdb) print unit->manager
$2 = (Manager *) 0x0
(gdb)
In fact the entirety of unit
seems to be zeroed:
(gdb) print *unit
$3 = {manager = 0x0, type = UNIT_SERVICE, load_state = UNIT_STUB, merged_into = 0x0, id = 0x0, instance = 0x0, aliases = 0x0, dependencies = 0x0, requires_mounts_for = 0x0, description = 0x0, documentation = 0x0, fragment_path = 0x0, source_path = 0x0, dropin_paths = 0x0, fragment_not_found_timestamp_hash = 0,
fragment_mtime = 0, source_mtime = 0, dropin_mtime = 0, transient_file = 0x0, pending_freezer_message = 0x0, freezer_state = FREEZER_RUNNING, job_timeout_action = EMERGENCY_ACTION_NONE, job_timeout = 0, job_running_timeout = 0, job_timeout_reboot_arg = 0x0, job = 0x0, nop_job = 0x0, match_bus_slot = 0x0,
get_name_owner_slot = 0x0, bus_track = 0x0, deserialized_refs = 0x0, refs_by_target = 0x0, conditions = 0x0, asserts = 0x0, condition_timestamp = {realtime = 0, monotonic = 0}, assert_timestamp = {realtime = 0, monotonic = 0}, state_change_timestamp = {realtime = 0, monotonic = 0}, inactive_exit_timestamp = {
realtime = 0, monotonic = 0}, active_enter_timestamp = {realtime = 0, monotonic = 0}, active_exit_timestamp = {realtime = 0, monotonic = 0}, inactive_enter_timestamp = {realtime = 0, monotonic = 0}, units_by_type_next = 0x0, units_by_type_prev = 0x0, load_queue_next = 0x0, load_queue_prev = 0x0,
dbus_queue_next = 0x0, dbus_queue_prev = 0x0, cleanup_queue_next = 0x0, cleanup_queue_prev = 0x0, gc_queue_next = 0x0, gc_queue_prev = 0x0, cgroup_realize_queue_next = 0x0, cgroup_realize_queue_prev = 0x0, cgroup_empty_queue_next = 0x0, cgroup_empty_queue_prev = 0x0, cgroup_oom_queue_next = 0x0,
cgroup_oom_queue_prev = 0x0, target_deps_queue_next = 0x0, target_deps_queue_prev = 0x0, stop_when_unneeded_queue_next = 0x0, stop_when_unneeded_queue_prev = 0x0, start_when_upheld_queue_next = 0x0, start_when_upheld_queue_prev = 0x0, stop_when_bound_queue_next = 0x0, stop_when_bound_queue_prev = 0x0,
pids = 0x0, sigchldgen = 0, notifygen = 0, gc_marker = 0, load_error = 0, start_ratelimit = {interval = 0, burst = 0, num = 0, begin = 0}, start_limit_action = EMERGENCY_ACTION_NONE, markers = 0, success_action = EMERGENCY_ACTION_NONE, failure_action = EMERGENCY_ACTION_NONE, success_action_exit_status = 0,
failure_action_exit_status = 0, reboot_arg = 0x0, auto_start_stop_ratelimit = {interval = 0, burst = 0, num = 0, begin = 0}, ref_uid = 0, ref_gid = 0, unit_file_state = UNIT_FILE_ENABLED, unit_file_preset = 0, cpu_usage_base = 0, cpu_usage_last = 0, managed_oom_kill_last = 0, oom_kill_last = 0,
io_accounting_base = {0, 0, 0, 0}, io_accounting_last = {0, 0, 0, 0}, cgroup_path = 0x0, cgroup_realized_mask = 0, cgroup_enabled_mask = 0, cgroup_invalidated_mask = 0, cgroup_members_mask = 0, cgroup_control_inotify_wd = 0, cgroup_memory_inotify_wd = 0, bpf_device_control_installed = 0x0,
ip_accounting_ingress_map_fd = 0, ip_accounting_egress_map_fd = 0, ip_accounting_extra = {0, 0, 0, 0}, ipv4_allow_map_fd = 0, ipv6_allow_map_fd = 0, ipv4_deny_map_fd = 0, ipv6_deny_map_fd = 0, ip_bpf_ingress = 0x0, ip_bpf_ingress_installed = 0x0, ip_bpf_egress = 0x0, ip_bpf_egress_installed = 0x0,
ip_bpf_custom_ingress = 0x0, ip_bpf_custom_ingress_installed = 0x0, ip_bpf_custom_egress = 0x0, ip_bpf_custom_egress_installed = 0x0, bpf_foreign_by_key = 0x0, initial_socket_bind_link_fds = 0x0, rewatch_pids_event_source = 0x0, on_success_job_mode = JOB_FAIL, on_failure_job_mode = JOB_FAIL,
collect_mode = COLLECT_INACTIVE, invocation_id = {bytes = '\000' <repeats 15 times>, qwords = {0, 0}}, invocation_id_string = '\000' <repeats 32 times>, stop_when_unneeded = false, default_dependencies = false, refuse_manual_start = false, refuse_manual_stop = false, allow_isolate = false,
ignore_on_isolate = false, condition_result = false, assert_result = false, transient = false, perpetual = false, in_load_queue = false, in_dbus_queue = false, in_cleanup_queue = false, in_gc_queue = false, in_cgroup_realize_queue = false, in_cgroup_empty_queue = false, in_cgroup_oom_queue = false,
in_target_deps_queue = false, in_stop_when_unneeded_queue = false, in_start_when_upheld_queue = false, in_stop_when_bound_queue = false, sent_dbus_new_signal = false, job_running_timeout_set = false, in_audit = false, on_console = false, cgroup_realized = false, cgroup_members_mask_valid = true,
reset_accounting = false, start_limit_hit = false, coldplugged = false, bus_track_add = false, exported_invocation_id = false, exported_log_level_max = false, exported_log_extra_fields = false, exported_log_ratelimit_interval = false, exported_log_ratelimit_burst = false,
warned_clamping_cpu_quota_period = false, last_section_private = 0}
I'm disinclined to suspect that this zeroing came from the storage stack, though, for two reasons. One is that there are nonzero bytes on this page, suggesting that it wasn't paged out and then read back as a page of zeroes. The other is that every indication I know of (/etc/fstab
/, /proc/swaps
, and /proc/meminfo
) indicates that this guest has no swap device set up. I think this means there's no way that the process's data could have been paged out and then misread by the NVMe device.
I can confirm that the problem doesn't repro with a virtio disk, so I think continuing to focus on what's different with NVMe makes sense, but I'm not totally convinced this is the storage device causing data corruption vs. something about our environment that happens to trigger a systemd bug only when an NVMe device is present. (Running with a virtual NVMe device, for example, might change the amount of time it takes for certain filesystem startup daemons to run, which might trigger a race in systemd.) I'll think about what I might be able to do to provoke these kinds of races if they exist.
Along those lines, I did a test run with all four of the stock Propolis NVMe USDT probes (read/write enqueue/completion) enabled with a chill
action that paused for up to 100us. This definitely slowed down the test, but I don't think it meaningfully changed the test failure rate.
Unrelated to the Ubuntu cores specifically: the other day I also ran an overnight fio
test on a virtual NVMe device with verification options enabled. This didn't produce any verification errors.
Another tiny observation: changing between a virtio device and an NVMe device changes some of the device paths that will be referred to in multipathd
, which is also a frequent segfaulter (2 of the 3 cores in my last test run).
If I just run sudo multipath
repeatedly from an ssh session I see it explode in many strange and wonderful ways (Inconsistency detected by ld.so: dl-close.c: 184: _dl_close_worker: Assertion
idx == nloaded' failed!,
malloc(): unsorted double linked list corrupted`).
I can repro the boot timeouts after changing my test VM configuration to have a virtio boot disk and a blank NVMe disk (i.e. a setup where we're not reading any of the OS data off the NVMe drive, but the NVMe driver still has to be loaded and any services that care about attached disks have to reckon with the fact that it's there).
After several reports of multipathd being on the scene when instances failed their boot, often due to timeouts, I decided to take a closer look. I reset the underlying disk image to ensure that it had not suffered any persistent corruption from prior testing.
With the boot-loop reproducer running, I observed a time-out case where it was the multipath service which failed and prevented forward progress. I booted the instance up manually after that, and went to inspect multipathd
itself. Upon running it, I was greeted with a surprise: the linker was unable to load /lib/libmultipath.so.0
. Inspection with hexdump
showed that the first 0x200 bytes were zero. Using the debugfs
tool, I queried the extents/blocks which hold that file and confirmed, while the instance was still running, that the blocks in the underlying zvol were still valid (they still bore the proper ELF header, etc).
I repeated testing like this several times manually, with multipathd
working intermittently. Simultaneously, @gjcolombo spun up a test where multipathd
was disabled in systemd, which seemed to mitigate the issue at hand.
For full coverage, I instrumented the admin command path (in addition to the probes I had in the nvme data path), and ran multipathd
manually during a boot where it was not stricken by a corrupted page in its supporting library. I could see that as part of querying disks during startup, multipathd
caused an Identify command to be issued to the admin queue on the root device. Printing the details of that Identify made the problem clear, after reading the emulation code:
ADMIN CMD: RawSubmission {
cdw0: 807010310,
nsid: 0,
rsvd: 0,
mptr: 0,
prp1: 4483852544,
prp2: 4484657152,
cdw10: 1,
cdw11: 0,
cdw12: 0,
cdw13: 0,
cdw14: 0,
cdw15: 0,
}
Notable is that both prp1 and prp2 are non-zero. Converted to hex, the situation is more clear:
prp1: 10b422900
prp2: 10b4e7000
Even though the Identify command is expected to only emit a single page (4KB) of data, because the first PRP is offset into its page (legal per the spec), the rest of the data which would otherwise cross that page boundary must be copied into the second PRP (which then must be page-aligned) instead. If we look at the handling for those copied, the problem becomes apparent:
https://github.com/oxidecomputer/propolis/blob/1598c84a7a9d7684ad14ecd4b83b581a3bc182a6/lib/propolis/src/hw/nvme/admin.rs#L211-L215
This code expects a single GuestRegion
to be emitted from the PrpIter
representing the command output buffer(s). It writes the entire page-long response into that address. Because the guest, in this case, was passing a buffer with an offset, we will overflow by whatever that offset is into the subsequent page.
This logic (and, as pointed out by @luqmana, the logic in GetLogPage) needs to be updated to handle buffers from the PRPs.
The issue is sporadic - it only happened on one of the 13 identical 4 vcpu / 16 Gb instances I created using the same jammy cloud image (but several other instances failed at different points of guest initialization - will have other tickets for each of the failure modes).
Here is the console log as seen in the rack2 Console UI (https://recovery.sys.rack2.eng.oxide.computer/projects/try/instances/sysbench-mysql-3/serial-console)
The propolis zone is located in sled BRM42220014 (rack2, cubby 16). I've copied the propolis log file to: catacomb.eng.oxide.computer:/data/staff/dogfood/jun-03/system-illumos-propolis-server_vm-738d9902-01a8-4343-8593-c687a8782e69.log