oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

test failure: `integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates` #6727

Open hawkw opened 1 month ago

hawkw commented 1 month ago

On branch #6726 commit b3346d917f086c7de6cdebbe723e3017a3783abc, I saw this test failure:

running 1 test
test integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates has been running for over 60 seconds
test integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates ... FAILED

failures:

failures:
    integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 365 filtered out; finished in 164.25s

--- STDERR:              omicron-nexus::test_all integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates ---
log file: /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.0.log
note: configured to log to "/tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.0.log"
DB URL: postgresql://root@[::1]:38623/omicron?sslmode=disable
DB address: [::1]:38623
log file: /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.2.log
note: configured to log to "/tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.2.log"
log file: /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.3.log
note: configured to log to "/tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.3.log"
[nexus/tests/integration_tests/instances.rs:1368:23] make_forgotten_instance(&cptestctx, "resurgam",
        InstanceAutoRestartPolicy::BestEffort).await = 26edfe76-56d4-4c9e-9420-9218e91ea268 (instance)
[nexus/tests/integration_tests/instances.rs:1377:5] nexus_test_utils::background::activate_background_task(&cptestctx.internal_client,
        "instance_watcher").await = BackgroundTask {
    current: Idle,
    description: "periodically checks instance states",
    last: Completed(
        LastResultCompleted {
            details: Object {
                "failed_checks": Object {
                    "no_such_instance": Number(1),
                },
                "incomplete_checks": Object {},
                "instance_states": Object {},
                "pruned_instances": Number(0),
                "total_instances": Number(1),
                "update_sagas_queued": Number(1),
            },
            elapsed: Duration {
                nanos: 375909468,
                secs: 1,
            },
            iteration: 2,
            reason: Signaled,
            start_time: 2024-09-30T17:30:10.979761748Z,
        },
    ),
    name: "instance_watcher",
    period: Duration {
        nanos: 0,
        secs: 30,
    },
}
[nexus/tests/integration_tests/instances.rs:1396:5] instance_wait_for_state(client, instance_id, InstanceState::Starting).await = Instance {
    identity: IdentityMetadata {
        id: 26edfe76-56d4-4c9e-9420-9218e91ea268,
        name: Name(
            "resurgam",
        ),
        description: "instance \"resurgam\"",
        time_created: 2024-09-30T17:30:06.993226Z,
        time_modified: 2024-09-30T17:30:06.993226Z,
    },
    project_id: cfb19f9d-8b7f-49d9-a6e5-7fc799bc6be5,
    ncpus: InstanceCpuCount(
        4,
    ),
    memory: ByteCount(
        1073741824,
    ),
    hostname: "the-host",
    runtime: InstanceRuntimeState {
        run_state: Starting,
        time_run_state_updated: 2024-09-30T17:30:12.371281Z,
        time_last_auto_restarted: Some(
            2024-09-30T17:30:12.419981Z,
        ),
    },
    auto_restart_status: InstanceAutoRestartStatus {
        enabled: true,
        cooldown_expiration: Some(
            2024-09-30T18:30:12.419981Z,
        ),
    },
}
[nexus/tests/integration_tests/instances.rs:1402:5] instance_wait_for_vmm_registration(cptestctx, &instance_id).await = ()
[nexus/tests/integration_tests/instances.rs:1404:5] instance_simulate(nexus, &instance_id).await = ()
thread 'integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates' panicked at nexus/tests/integration_tests/instances.rs:5382:19:
instance 26edfe76-56d4-4c9e-9420-9218e91ea268 did not transition to Running after 120s: timed out after 120.235074552s
stack backtrace:
   0: rust_begin_unwind
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
   1: core::panicking::panic_fmt
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
   2: {async_fn#0}
             at ./tests/integration_tests/instances.rs:5382:19
   3: {async_fn#0}
             at ./tests/integration_tests/instances.rs:5333:6
   4: {async_fn#0}
             at ./tests/integration_tests/instances.rs:1407:14
   5: {async_block#0}
             at ./tests/integration_tests/instances.rs:1362:1
   6: poll<&mut dyn core::future::future::Future<Output=()>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   7: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
   8: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:57
   9: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:107:5
  10: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/coop.rs:73:5
  11: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:673:25
  12: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:412:19
  13: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:672:36
  14: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:68
  15: tokio::runtime::context::scoped::Scoped<T>::set
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/scoped.rs:40:9
  16: tokio::runtime::context::set_scheduler::{{closure}}
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:26
  17: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
  18: std::thread::local::LocalKey<T>::with
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
  19: tokio::runtime::context::set_scheduler
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context.rs:180:9
  20: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:751:27
  21: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:660:19
  22: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:180:28
  23: tokio::runtime::context::runtime::enter_runtime
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/context/runtime.rs:65:16
  24: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/scheduler/current_thread/mod.rs:168:9
  25: tokio::runtime::runtime::Runtime::block_on_inner
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:361:47
  26: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
             at /home/eliza/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.39.3/src/runtime/runtime.rs:335:13
  27: test_instance_failed_by_instance_watcher_automatically_reincarnates
             at ./tests/integration_tests/instances.rs:1362:1
  28: test_all::integration_tests::instances::test_instance_failed_by_instance_watcher_automatically_reincarnates::{{closure}}
             at ./tests/integration_tests/instances.rs:1365:2
  29: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
  30: core::ops::function::FnOnce::call_once
             at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: "/tmp/.tmpeagOTP"
        If you would like to access the database for debugging, run the following:

        # Run the database
        cargo xtask db-dev run --no-populate --store-dir "/tmp/.tmpeagOTP/data"
        # Access the database. Note the port may change if you run multiple databases.
        cockroach sql --host=localhost:32221 --insecure
WARN: dropped ClickHouse process without cleaning it up first (there may still be a child process running (PID 2149884) and a temporary directory leaked, /tmp/test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.1-clickhouse-u4U4l4)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /tmp/.tmpCeuyEZ
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /tmp/.tmpLrWu8z
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /tmp/.tmpAEq29n
WARN: dropped MgdInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: mgd temporary directory leaked: /tmp/.tmpbVjJR5

   Canceling due to test failure: 23 tests still running

Reran the test and it appears to pass. I wonder if it just took slightly longer than 2 minutes for the instance to transition back from Starting -> Running? Feels like a pretty long time though...

Log file: test_all-afb586eba2d02dc7-test_instance_failed_by_instance_watcher_automatically_reincarnates.2148721.0.log

NOTE: Consider attaching any log files produced by the test.

hawkw commented 1 month ago

This is my test, so I'm gonna try to figure out what's gone wrong here once once I finish more urgent work.

jgallagher commented 1 month ago

Saw this on https://github.com/oxidecomputer/omicron/runs/31050364405 / https://buildomat.eng.oxide.computer/wg/0/details/01J9A281VZ5AWSENGSBPS2SZBM/YgsiBVpZeZs8xKwEs3NG8vJWRHaj6HRg4P2sOa8fSLBWOjXN/01J9A28XVZQ35MEGH8KPZZ9Q26. At a glance it looks basically the same (other than a new policy field in the logged structs?), so not sure there's any new info there, but thought I'd note it in case that's wrong.

david-crespo commented 1 month ago

Just saw this on #6824.

https://github.com/oxidecomputer/omicron/runs/31513295998

https://buildomat.eng.oxide.computer/wg/0/details/01JA5ZXRB1CWA2BQ4G6XCHX07Q/nEHLkJyxA8Ap0rs1D8ue8tqis5q1jxc2XuUv8R28HeLuJ4dI/01JA5ZY9CP0CVDARCH7TZJAXSV