test failed in CI: test_omdb_success_cases

rcgoodfellow commented 1 week ago

This test failed on a CI run on pull request 6475:

https://github.com/oxidecomputer/omicron/pull/6475/checks?check_run_id=29546110600

https://buildomat.eng.oxide.computer/wg/0/details/01J6RJ0W9K2R1TX0DVBZ0RS47V/qhyGpI4O40yzHVoFHWrAhRBFaESiU4fFqaOicq5NLEyLHAz2/01J6RJ164N5KYG7G3SJ5PFFX0H

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J6RJ0W9K2R1TX0DVBZ0RS47V/qhyGpI4O40yzHVoFHWrAhRBFaESiU4fFqaOicq5NLEyLHAz2/01J6RJ164N5KYG7G3SJ5PFFX0H#S5276

Excerpt from the log showing the failure:

        FAIL [  25.576s] omicron-omdb::test_all_output test_omdb_success_cases

--- STDOUT:              omicron-omdb::test_all_output test_omdb_success_cases ---

running 1 test
running commands with args: ["db", "disks", "list"]
running commands with args: ["db", "dns", "show"]
running commands with args: ["db", "dns", "diff", "external", "2"]
running commands with args: ["db", "dns", "names", "external", "2"]
running commands with args: ["db", "instances"]
running commands with args: ["db", "reconfigurator-save", "/var/tmp/omicron_tmp/.tmpVjFflB/reconfigurator-save.out"]
running commands with args: ["db", "sleds"]
running commands with args: ["db", "sleds", "-F", "discretionary"]
running commands with args: ["mgs", "inventory"]
running commands with args: ["nexus", "background-tasks", "doc"]
running commands with args: ["nexus", "background-tasks", "show"]
running commands with args: ["nexus", "background-tasks", "show", "saga_recovery"]
running commands with args: ["nexus", "background-tasks", "show", "blueprint_loader", "blueprint_executor"]
running commands with args: ["nexus", "background-tasks", "show", "dns_internal"]
running commands with args: ["nexus", "background-tasks", "show", "dns_external"]
running commands with args: ["nexus", "background-tasks", "show", "all"]
running commands with args: ["nexus", "sagas", "list"]
running commands with args: ["--destructive", "nexus", "sagas", "demo-create"]
running commands with args: ["nexus", "sagas", "list"]
running commands with args: ["--destructive", "nexus", "background-tasks", "activate", "inventory_collection"]
running commands with args: ["nexus", "blueprints", "list"]
running commands with args: ["nexus", "blueprints", "show", "5103da0a-8625-4be7-b03e-16ff5fde04a9"]
running commands with args: ["nexus", "blueprints", "show", "current-target"]
running commands with args: ["nexus", "blueprints", "diff", "5103da0a-8625-4be7-b03e-16ff5fde04a9", "current-target"]
@@ -55,10 +55,13 @@
 ID NAME STATE PROPOLIS_ID SLED_ID HOST_SERIAL
 ---------------------------------------------
 stderr:
 note: using database URL postgresql://root@[::1]:REDACTED_PORT/omicron?sslmode=disable
 note: database schema version matches expected (<redacted database version>)
+thread 'tokio-runtime-worker' panicked at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-bb8-diesel-0.2.1/src/async_traits.rs:97:14:
+called `Result::unwrap()` on an `Err` value: JoinError::Cancelled(Id(36))
+note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
 =============================================
 EXECUTING COMMAND: omdb ["db", "reconfigurator-save", "<TMP_PATH_REDACTED>"]
 termination: Exited(0)
 ---------------------------------------------
 stdout:

test test_omdb_success_cases ... FAILED

failures:

failures:
    test_omdb_success_cases

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 2 filtered out; finished in 25.36s

--- STDERR:              omicron-omdb::test_all_output test_omdb_success_cases ---
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.0.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.0.log"
DB URL: postgresql://root@[::1]:43788/omicron?sslmode=disable
DB address: [::1]:43788
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.2.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.2.log"
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.3.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.3.log"
log file: /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_case.19791.4.log
note: configured to log to "/var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_case.19791.4.log"
thread 'test_omdb_success_cases' panicked at dev-tools/omdb/tests/test_all_output.rs:242:5:
assertion failed: string doesn't match the contents of file: "tests/successes.out" see diffset above
                set EXPECTORATE=overwrite if these changes are intentional
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: assert_contents<&str>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/expectorate-1.1.0/src/lib.rs:64:9
   3: {async_fn#0}
             at ./tests/test_all_output.rs:242:5
   4: {async_block#0}
             at ./tests/test_all_output.rs:116:1
   5: poll<&mut dyn core::future::future::Future<Output=()>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   6: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   7: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:57
   8: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:107:5
   9: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:73:5
  10: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:25
  11: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:412:19
  12: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:672:36
  13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:68
  14: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/scoped.rs:40:9
  15: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:26
  16: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
  17: std::thread::local::LocalKey<T>::with
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
  18: tokio::runtime::context::set_scheduler
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:9
  19: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:27
  20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:660:19
  21: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:180:28
  22: tokio::runtime::context::runtime::enter_runtime
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/runtime.rs:65:16
  23: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:168:9
  24: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:361:47
  25: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:335:13
  26: test_omdb_success_cases
             at ./tests/test_all_output.rs:116:1
  27: test_all_output::test_omdb_success_cases::{{closure}}
             at ./tests/test_all_output.rs:117:70
  28: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  29: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/var/tmp/omicron_tmp/.tmpDdfI0d"
If you would like to access the database for debugging, run the following:

# Run the database
cargo xtask db-dev run --no-populate --store-dir "/var/tmp/omicron_tmp/.tmpDdfI0d/data"
# Access the database. Note the port may change if you run multiple databases.
cockroach sql --host=localhost:32221 --insecure
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
failed to clean up ClickHouse data dir:
- /var/tmp/omicron_tmp/test_all_output-ce8c2ad688e5b1af-test_omdb_success_cases.19791.1-clickhouse-Tjceqh: File exists (os error 17)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpgUhVaY
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpU3R6qP
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpoJXOMm
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpoo1igu

davepacheco commented 1 week ago

Bummer -- and thanks for filing this.

From the output, it looks to me like the test ran the command omdb db instances and that panicked with:

thread 'tokio-runtime-worker' panicked at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/async-bb8-diesel-0.2.1/src/async_traits.rs:97:14:
called `Result::unwrap()` on an `Err` value: JoinError::Cancelled(Id(36))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

That's not very much to go on. We don't have more because this was a subprocess -- the test ultimately failed only because the output didn't match what it expected. I haven't totally given up yet but I've put up #6516 so that if we hit this again we'll get more information about a panic from the subprocess.

davepacheco commented 1 week ago

The panic message is coming from here: https://github.com/oxidecomputer/async-bb8-diesel/blob/1850c9d9a9311ff6a60cadee9023e7693eda3304/src/async_traits.rs#L97

But I think that's just propagating a panic that happened in the middle of just about anything that async-bb8-diesel was doing. There are a few unwraps in in the omdb db instances command itself: https://github.com/oxidecomputer/omicron/blob/a77c31bf0238c08fc221609e9134c1c519d584f7/dev-tools/omdb/src/bin/omdb/db.rs#L2837-L2920

But if we panicked in those, I don't think it would show up in async-bb8-diesel. I'm trying to figure out what would show up there. We're not using transaction_async in this code so I don't see how we could have entered async-bb8-diesel and then called back out to this code. An example might be if the synchronous load panicked, but that's not our code so that would be surprising.

I'm also going to file an async-bb8-diesel bug because it seems like it could propagate more about the panic error in this situation.

davepacheco commented 1 week ago

Actually, I'm not sure this is an async-bb8-diesel bug. Looking more closely at the JoinError, it's saying that the underlying task was cancelled, not that it panicked. How did that happen? Looking at the docs:

When you shut down the executor, it will wait indefinitely for all blocking operations to finish. You can use shutdown_timeout to stop waiting for them after a certain timeout. Be aware that this will still not cancel the tasks — they are simply allowed to keep running after the method returns. It is possible for a blocking task to be cancelled if it has not yet started running, but this is not guaranteed.

One way I could imagine this happening is if the program started an async database operation (like load_async) but then panicked before the corresponding tokio task was started. That might trigger teardown of the executor and we might see this second panic. But then shouldn't we see some information about that other panic?

oxidecomputer / omicron

test failed in CI: test_omdb_success_cases #6505