oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
251 stars 40 forks source link

Ipv4NatGarbageCollector dumped core #5278

Open rcgoodfellow opened 8 months ago

rcgoodfellow commented 8 months ago

After a rack upgrade a nexus core was collected containing the following.

> ::status
debugging core file of nexus (64-bit) from oxz_nexus_2898657e-4141-4c05-851b-147bffc6bbbd
initial argv: /opt/oxide/omicron-nexus/bin/nexus /var/svc/manifest/site/nexus/config.toml
threading model: native threads
status: process terminated by SIGABRT (Abort), pid=12132 uid=0 code=-1
> $C
fffff5ffd0bfea40 libc.so.1`_lwp_kill+0xa()
fffff5ffd0bfea70 libc.so.1`raise+0x22(6)
fffff5ffd0bfeac0 libc.so.1`abort+0x58()
fffff5ffd0bfead0 0x56dcf89()
fffff5ffd0bfeae0 0x56dcf79()
fffff5ffd0bfeb40 rust_panic+0xd()
fffff5ffd0bfec00 std::panicking::rust_panic_with_hook::h7d19ef586749da2f+0x2ab()
fffff5ffd0bfec40 std::panicking::begin_panic_handler::{{closure}}::h95e3e1ca24551a68+0xa4()
fffff5ffd0bfec50 0x56c26c9()
fffff5ffd0bfec80 0x56c4fd2()
fffff5ffd0bfecc0 0x570be95()
fffff5ffd0bfed40 0x570c525()
fffff5ffd0bff6e0 <omicron_nexus::app::background::nat_cleanup::Ipv4NatGarbageCollector as omicron_nexus::app::background::common::BackgroundTask>::activate::{{closure}}::h2189056f623f6a3e+0x2d19()
fffff5ffd0bff810 omicron_nexus::app::background::common::TaskExec::activate::{{closure}}::ha5c311b342fd4651+0x233()
fffff5ffd0bff8a0 omicron_nexus::app::background::common::TaskExec::run::_$u7b$$u7b$closure$u7d$$u7d$::hc717d6c73ec32d5f +0x467()
fffff5ffd0bffb40 tokio::runtime::task::harness::Harness<T,S>::poll::hedd51b4b5d5db000+0x83()
fffff5ffd0bffb90 tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h169228c0d2ddb6ea+0x146()
fffff5ffd0bffc50 tokio::runtime::context::scoped::Scoped<T>::set::h05e85aeb0fa37f6c+0xabe()
fffff5ffd0bffd10 tokio::runtime::context::runtime::enter_runtime::h3943341dd71d2074+0x193()
fffff5ffd0bffd40 tokio::runtime::scheduler::multi_thread::worker::run::hcde3f3ceaae72a57+0x4b()
fffff5ffd0bffdb0 tokio::runtime::task::core::Core<T,S>::poll::h8f16d80a0bdd7429+0x73()
fffff5ffd0bffe10 tokio::runtime::task::harness::Harness<T,S>::poll::h940736824474ad7c+0x97()
fffff5ffd0bffeb0 tokio::runtime::blocking::pool::Inner::run::hd055f1d45ec7c074+0xe4()
fffff5ffd0bffef0 std::sys_common::backtrace::__rust_begin_short_backtrace::h80cb4c50a4e8edb0+0x3f()
fffff5ffd0bfff60 core::ops::function::FnOnce::call_once{{vtable.shim}}::hf81eed2c2f854ef9+0x75()
fffff5ffd0bfffb0 std::sys::unix::thread::Thread::new::thread_start::h1783cbcbbf061711+0x29()
fffff5ffd0bfffe0 libc.so.1`_thrp_setup+0x77(fffff5ffeddc9a40)
fffff5ffd0bffff0 libc.so.1`_lwp_start()

cc: @internet-diglett

The core is available at

/staff/dogfood/cores-collected-20240316/core.oxz_nexus_2898657e-4141-4c05-851b-147bffc6bbbd.nexus.12132.1710437390
rcgoodfellow commented 8 months ago

I found the logs that appear to be associated with this panic and see this.

thread 'tokio-runtime-worker' panicked at nexus/src/app/background/nat_cleanup.rs:91:18:
called `Result::unwrap()` on an `Err` value: ServiceUnavailable { internal_message: "Failed to access DB connection: Timed out in bb8" }

The full logs have been copied to /staff/dogfood/cores-collected-20240316 alongside the core.

internet-diglett commented 7 months ago

Womp. Looks like an unwrap()slipped through. Fortunately an easy fix.