oxidecomputer / propolis

VMM userspace for illumos bhyve
Mozilla Public License 2.0
173 stars 22 forks source link

Attempting to boot helios VM lead to a panic in propolis #207

Closed leftwo closed 1 week ago

leftwo commented 1 year ago

While trying to boot a helios image on propolis, I got a panic in propolis. From the propolis-server log:

Aug 29 21:35:05.666 INFO unhandled PIO, bytes: 1, port: 45025, op: out, sync_task: vcpu-0, component: vmm                                                                                                                         
Aug 29 21:35:08.842 INFO unhandled PIO, bytes: 1, port: 1012, op: in, sync_task: vcpu-0, component: vmm                                                                                                                           
Aug 29 21:35:08.842 INFO unhandled PIO, bytes: 1, port: 890, op: out, sync_task: vcpu-0, component: vmm                                                                                                                           
Aug 29 21:35:08.843 INFO unhandled PIO, bytes: 1, port: 890, op: in, sync_task: vcpu-0, component: vmm                                                                                                                            
Aug 29 21:35:08.843 INFO unhandled PIO, bytes: 1, port: 889, op: in, sync_task: vcpu-0, component: vmm                                                                                                                            
Aug 29 21:35:09.137 INFO unhandled PIO, bytes: thread '2vcpu-1, ' panicked at 'source slice length (1) does not match destination slice length (2)', lib/propolis/src/common.rs:156:port29:                                       
 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace                                   
145, op: in, sync_task: vcpu-1, component: vmm[ Aug 29 14:35:09 Stopping because all processes in service exited. ]                                                     
> 0x37::findstack -v ! demangle
stack pointer for thread 55 [vcpu-1]: fffffc7fe07ff770
[ fffffc7fe07ff770 libc.so.1`_lwp_kill+0xa() ]
  fffffc7fe07ff7a0 libc.so.1`raise+0x1e(6)
  fffffc7fe07ff7f0 libc.so.1`abort+0x58()
  fffffc7fe07ff800 panic_abort::__rust_start_panic::abort::h51a8d58b18c16359+9()
  fffffc7fe07ff810 __rust_start_panic+9()
  fffffc7fe07ff8d0 rust_panic+0x1c()
  fffffc7fe07ff9a0 std::panicking::rust_panic_with_hook::heb92177829641f0d+0x1d8()
  fffffc7fe07ff9f0 std::panicking::begin_panic_handler::{{closure}}::h5c5e5cd8ca29cc02+0xb5()
  fffffc7fe07ffa00 std::sys_common::backtrace::__rust_end_short_backtrace::hf62993b7ca58fe14+0x17()
  fffffc7fe07ffa40 rust_begin_unwind+0x4a()
  fffffc7fe07ffa80 core::panicking::panic_fmt::hf432d58fa631f736+0x33()
  fffffc7fe07ffaf0 core::slice::<impl [T]>::copy_from_slice::len_mismatch_fail::h58bbecf6b0c0b8be+0x66()
  fffffc7fe07ffb20 propolis::common::ReadOp::write_bytes::h0eaa31f76be8085d+0xf0()
  fffffc7fe07ffb40 propolis::hw::chipset::i440fx::Piix3Lpc::attach::{{closure}}::h87615dc9c3f089b1+0x89()
  fffffc7fe07ffc50 propolis::pio::PioBus::handle_in::h3d9ec86e614a6d9d+0x2b3()
  fffffc7fe07ffe20 propolis::vcpu_run_loop::hbd393684c880ff89+0x556()
  fffffc7fe07ffe50 core::ops::function::FnOnce::call_once{{vtable.shim}}::hb787730e3d3e3a32+0x1e()
  fffffc7fe07ffee0 std::sys_common::backtrace::__rust_begin_short_backtrace::ha198b9610f5e8248+0x90()
  fffffc7fe07fff60 core::ops::function::FnOnce::call_once{{vtable.shim}}::h27ac63f13065684d+0x95()
  fffffc7fe07fffb0 std::sys::unix::thread::Thread::new::thread_start::h30db11070d9e87f0+0x27()
  fffffc7fe07fffe0 libc.so.1`_thrp_setup+0x6c(fffffc7feecfa240)
  fffffc7fe07ffff0 libc.so.1`_lwp_start()

Two core files are available on catacomb.eng.oxide.computer in /data/staff/core/folgers/20220829

core.oxz_propolis-server_d5b7a3eb-c689-4658-b930-6348c88dc0ad.propolis-server.11042 core.oxz_propolis-server_d7685106-8a52-4dc5-8fad-930dd68dbf16.propolis-server.9200

The actual VM also panics, and this panic happens around the same time as the propolis panic, so I'm not 100% sure which one happens first.

leftwo commented 1 year ago
Sep 02 18:46:54.889 ERRO slog-async: logger dropped messages due to channel overflow, count: 29, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.890 INFO unhandled PIO, bytes: 1, port: 4, op: in, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.890 ERRO slog-async: logger dropped messages due to channel overflow, count: 78, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.890 ERRO slog-async: logger dropped messages due to channel overflow, count: 37, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.890 ERRO slog-async: logger dropped messages due to channel overflow, count: 35, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.890 ERRO slog-async: logger dropped messages due to channel overflow, count: 35, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.890 ERRO slog-async: logger dropped messages due to channel overflow, count: 41, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.890 ERRO slog-async: logger dropped messages due to channel overflow, count: 37, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.890 ERRO slog-async: logger dropped messages due to channel overflow, count: 35, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.890 ERRO slog-async: logger dropped messages due to channel overflow, count: 29, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.891 ERRO slog-async: logger dropped messages due to channel overflow, count: 23, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.891 ERRO slog-async: logger dropped messages due to channel overflow, countthread ':vcpu-1 ' panicked at '23, called `Result::unwrap()` on an `Err` value: Os { code: 14, kind: Uncategorized, message: "Bad address" }sync_task', :lib/propolis/src/lib.rs :vcpu-150, :component42:
 note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
vmm
Sep 02 18:46:54.891 ERRO slog-async: logger dropped messages due to channel overflow, count: 33, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.891 ERRO slog-async: logger dropped messages due to channel overflow, count: 21, sync_task: vcpu-1, component: vmm
Sep 02 18:46:54.891[ Sep  2 11:46:55 Stopping because all processes in service exited. ]
[ Sep  2 11:46:55 Executing stop method (:kill). ]

This looks like it's in vcpu_run_loop(), from the core:

--------------- thread# 55 / lwp# 55 [vcpu-1] ----------------
 fffffc7fef1582ca _lwp_kill () + a
 fffffc7fef0ec21e raise (6) + 1e
 fffffc7fef0c5e78 abort () + 58
 0000000001c38d39 panic_abort::__rust_start_panic::abort::h51a8d58b18c16359 () + 9
 0000000001c38d19 __rust_start_panic () + 9
 0000000001c2617c rust_panic () + 1c
 0000000001c25ee8 std::panicking::rust_panic_with_hook::heb92177829641f0d () + 1d8
 0000000001c25cf5 std::panicking::begin_panic_handler::{{closure}}::h5c5e5cd8ca29cc02 () + b5
 0000000001c22f67 std::sys_common::backtrace::__rust_end_short_backtrace::hf62993b7ca58fe14 () + 17
 0000000001c25a3a rust_begin_unwind () + 4a
 0000000001c78453 core::panicking::panic_fmt::hf432d58fa631f736 () + 33
 0000000001c78735 core::result::unwrap_failed::h7f371f070e6d974c () + 75
 000000000135d7e3 propolis::vcpu_run_loop::hbd393684c880ff89 () + 743
 000000000134f0de core::ops::function::FnOnce::call_once{{vtable.shim}}::hb787730e3d3e3a32 () + 1e
 00000000012fba20 std::sys_common::backtrace::__rust_begin_short_backtrace::ha198b9610f5e8248 () + 90
 00000000012e5da5 core::ops::function::FnOnce::call_once{{vtable.shim}}::h27ac63f13065684d () + 95
 0000000001c2e6f7 std::sys::unix::thread::Thread::new::thread_start::h30db11070d9e87f0 () + 27
 fffffc7fef150eac _thrp_setup (fffffc7feecfa240) + 6c
 fffffc7fef1511c0 _lwp_start ()

To make this panic, I create an instance from a helios image, then start it, where it panics like this:

Loading unix...                                                                 
Loading /platform/i86pc/amd64/boot_archive...                                   
Loading /platform/i86pc/amd64/boot_archive.hash...                              
Booting...                                                                      
Oxide Helios Version helios-1.0.21092 64-bit                                    
NOTICE: Performing full ZFS device scan!                                        
NOTICE: Cannot read the pool label from '/pseudo/lofi@1:b'                      
NOTICE: spa_import_rootpool: error 5                                            
Cannot mount root on /pseudo/lofi@1:b fstype zfs                                

panic[cpu0]/thread=fffffffffbc49060: vfs_mountroot: cannot mount root           

Warning - stack not written to the dump buffer                                  
fffffffffbc892d0 fffffffffbae8c07 ()                                            
fffffffffbc89310 genunix:main+137 ()                                            
fffffffffbc89320 unix:_locore_start+90 ()                                       

skipping system dump - no dump device configured                                
rebooting...

I just let it keep rebooting and after some unknown number of reboots, it hits a panic. It was less than 30 minutes, but I was not paying attention to how long, and it could be much less than 30.

New core is here:

/data/staff/core/folgers/20220902/core.oxz_propolis-server_1e9c8722-2439-4fc0-984b-c1ef5322e28e.propolis-server.11976
pfmooney commented 1 week ago

Closing this out, since I believe it was addressed in #306