oxidecomputer / management-gateway-service

Crates shared between MGS in omicron and its agent task in hubris
Mozilla Public License 2.0
3 stars 3 forks source link

Fix crash during Tokio runtime teardown #275

Closed mkeeter closed 3 weeks ago

mkeeter commented 3 weeks ago

While testing #274, I saw occasional panics during shutdown:

matt@jeeves ~ () $ env RUST_BACKTRACE=1 pfexec ./faux-mgs --discovery-addr='[fe80::aa40:25ff:fe05:100]:11111' --interface madrid_sw0tp0 monorail unlock --list
Aug 28 15:56:50.742 INFO creating SP handle on interface madrid_sw0tp0, component: faux-mgs
ecdsa-sha2-nistp256 AAAAE2VjZHNhLXNoYTItbmlzdHAyNTYAAAAIbmlzdHAyNTYAAABBBN7s3QT4S6Xmtqgd3z2/8LGOMk601NRGtKAehoDWki7AfVXZGvca33dvsi5so0EkndJhi+rwDNB0KxzeQq6iKY0= PIV_slot_9A /CN=oxide-support-001-auth
done
thread 'tokio-runtime-worker' panicked at gateway-sp-comms/src/shared_socket.rs:362:32:
recv() task died
stack backtrace:
   0: rust_begin_unwind
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/std/src/panicking.rs:597:5
   1: core::panicking::panic_fmt
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:72:14
   2: core::panicking::panic_display
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:168:5
   3: core::panicking::panic_str
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/panicking.rs:152:5
   4: core::option::expect_failed
             at /rustc/a28077b28a02b92985b3a3faecf92813155f1ea1/library/core/src/option.rs:1988:5
   5: <gateway_sp_comms::shared_socket::SingleSpHandle as gateway_sp_comms::single_sp::InnerSocket>::recv::{{closure}}
   6: <tokio::future::poll_fn::PollFn<F> as core::future::future::Future>::poll
   7: gateway_sp_comms::single_sp::Inner<T>::rpc_call::{{closure}}
   8: gateway_sp_comms::single_sp::Inner<T>::discover::{{closure}}
   9: gateway_sp_comms::single_sp::Inner<T>::run::{{closure}}
  10: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
  11: tokio::runtime::task::core::Core<T,S>::poll
  12: tokio::runtime::task::harness::Harness<T,S>::poll
  13: tokio::runtime::scheduler::multi_thread::worker::Context::run_task
  14: tokio::runtime::scheduler::multi_thread::worker::Context::run
  15: tokio::runtime::context::runtime::enter_runtime
  16: tokio::runtime::scheduler::multi_thread::worker::run
  17: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
  18: tokio::runtime::task::core::Core<T,S>::poll
  19: tokio::runtime::task::harness::Harness<T,S>::poll
  20: tokio::runtime::blocking::pool::Inner::run
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

This is the same issue discussed in my blog post: tasks are destroyed in arbitrary order, so the task owning the mpsc::Sender can be destroyed while the other task is still trying to read from it.

The fix is to make recv() return an Option<SingleSpMessage>, and then do all of the plumbing necessary to make the compiler happy.