oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
244 stars 36 forks source link

test failed in CI: `app::background::tasks::blueprint_execution::test::test_deploy_omicron_zones` #6568

Open bnaecker opened 6 days ago

bnaecker commented 6 days ago

This test failed on a CI run on pull request #6551:

https://github.com/oxidecomputer/omicron/pull/6551/checks?check_run_id=30071404933

Log showing the specific test failure:

https://buildomat.eng.oxide.computer/wg/0/details/01J7KRQA0QPH39CY96CQJF6Z23/1FrIoqw9yG0jDA7yi1gTIwhUFrVoqNJvyLQAGB0ClHEq3pm0/01J7KRR5P728MRF0TSY1N93XM3#S4699

Excerpt from the log showing the failure:

--- STDERR:              omicron-nexus app::background::tasks::blueprint_execution::test::test_deploy_omicron_zones ---
log file: /var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.0.log
note: configured to log to "/var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.0.log"
DB URL: postgresql://root@[::1]:41040/omicron?sslmode=disable
DB address: [::1]:41040
log file: /var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.2.log
note: configured to log to "/var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.2.log"
log file: /var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.3.log
note: configured to log to "/var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.3.log"
thread 'app::background::tasks::blueprint_execution::test::test_deploy_omicron_zones' panicked at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/httptest-0.15.5/src/server.rs:90:13:
received the following unexpected requests:
[
    Request {
        method: GET,
        uri: /inventory,
        version: HTTP/1.1,
        headers: {
            "accept": "application/json",
            "host": "[::1]:50941",
        },
        body: b"",
    },
]
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: verify_and_clear
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/httptest-0.15.5/src/server.rs:90:13
   3: {async_fn#0}
             at ./src/app/background/tasks/blueprint_execution.rs:463:9
   4: {async_block#0}
             at ./src/app/background/tasks/blueprint_execution.rs:267:5
   5: poll<&mut dyn core::future::future::Future<Output=()>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   6: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   7: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:57
   8: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:107:5
   9: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:73:5
  10: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:25
  11: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:412:19
  12: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:672:36
  13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:68
  14: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/scoped.rs:40:9
  15: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:26
  16: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
  17: std::thread::local::LocalKey<T>::with
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
  18: tokio::runtime::context::set_scheduler
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:9
  19: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:27
  20: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:660:19
  21: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:180:28
  22: tokio::runtime::context::runtime::enter_runtime
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/runtime.rs:65:16
  23: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:168:9
  24: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:361:47
  25: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:335:13
  26: test_deploy_omicron_zones
             at ./src/app/background/tasks/blueprint_execution.rs:267:5
  27: omicron_nexus::app::background::tasks::blueprint_execution::test::test_deploy_omicron_zones::{{closure}}
             at ./src/app/background/tasks/blueprint_execution.rs:268:76
  28: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  29: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/var/tmp/omicron_tmp/.tmpqQMj6Z"
    If you would like to access the database for debugging, run the following:

    # Run the database
    cargo xtask db-dev run --no-populate --store-dir "/var/tmp/omicron_tmp/.tmpqQMj6Z/data"
    # Access the database. Note the port may change if you run multiple databases.
    cockroach sql --host=localhost:32221 --insecure
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
failed to clean up ClickHouse data dir:
- /var/tmp/omicron_tmp/omicron_nexus-5f590c67750dc30c-test_deploy_omicron_zones.10132.1-clickhouse-LirrEm: File exists (os error 17)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpz45LCq
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpcak1sV
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpvh7J0T
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /var/tmp/omicron_tmp/.tmpYFkdVu
jgallagher commented 1 day ago

I think I understand this. This test starts some mock sled-agents then inserts records for those mock sled-agents into CRDB. It then proceeds to activate the Nexus RPW it's trying to test a few ways and confirms that task talks to the sled-agents as expected.

However, we're running under a #[nexus_test], so other Nexus RPWs entirely unrelated to blueprint execution are running. Presumably in this flake, the inventory collection RPW saw the CRDB records for our mock sled agents and tried to collect their inventories, causing the mock http server to complain about an unexpected request.

I'm not sure off the top of my head what the best fix for this is, though. Simulated sled agent instead of a mock? Turn off the other RPWs (if that's possible)? Tell the mock sled-agent to ignore unexpected requests?

bnaecker commented 1 day ago

Tell the mock sled-agent to ignore unexpected requests?

I like this approach. It looks like we have complete control over the mock servers, so we maybe we can keep track only of those counts we care about. The assertion can track that one count, rather than all requests.

jgallagher commented 1 day ago

Still happening after the fix in #6590 (https://github.com/oxidecomputer/omicron/pull/6593#issuecomment-2357427950). #6594 should fix this, but I'm worried it may just make the flake less likely. I'd like to leave this open for a while even after #6594 lands in case we see it again (at which point we probably need to rework that test to either use a different http mocking library or do its checks some other way).