oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

test failed in CI: integration_tests::metrics::test_metrics #6895

Closed smklein closed 2 weeks ago

smklein commented 4 weeks ago

This test failed on a CI run on https://github.com/oxidecomputer/omicron/pull/6881

https://github.com/oxidecomputer/omicron/pull/6881/checks?check_run_id=31702768426

Link here to the specific line of output from the buildomat log showing the failure: https://buildomat.eng.oxide.computer/wg/0/details/01JAE89628KSREHPCPZHC8DVS0/TV1XNO9ztAlr3QY3XQh3zGrYKkUsoCJHt0VvRzVuqCqsj1Ab/01JAE89QQBY3BRXDXDZ6N76977#S6309

Excerpt from the log showing the failure:

6325    2024-10-17T22:43:00.424Z    --- STDERR:              omicron-nexus::test_all integration_tests::metrics::test_metrics ---
6326    2024-10-17T22:43:00.424Z    log file: /var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.0.log
6327    2024-10-17T22:43:00.424Z    note: configured to log to "/var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.0.log"
6328    2024-10-17T22:43:00.424Z    DB URL: postgresql://root@[::1]:41953/omicron?sslmode=disable
6329    2024-10-17T22:43:00.424Z    DB address: [::1]:41953
6330    2024-10-17T22:43:00.424Z    log file: /var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.2.log
6331    2024-10-17T22:43:00.424Z    note: configured to log to "/var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.2.log"
6332    2024-10-17T22:43:00.424Z    log file: /var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.3.log
6333    2024-10-17T22:43:00.424Z    note: configured to log to "/var/tmp/omicron_tmp/test_all-e67ea319cd46d578-test_metrics.149544.3.log"
6334    2024-10-17T22:43:00.424Z    thread 'integration_tests::metrics::test_metrics' panicked at nexus/tests/integration_tests/metrics.rs:804:6:
6335    2024-10-17T22:43:00.424Z    Failed to find producer within time limit: TimedOut(60.187220276s)
6336    2024-10-17T22:43:00.424Z    stack backtrace:
6337    2024-10-17T22:43:00.424Z       0: rust_begin_unwind
6338    2024-10-17T22:43:00.424Z                 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/panicking.rs:652:5
6339    2024-10-17T22:43:00.424Z       1: core::panicking::panic_fmt
6340    2024-10-17T22:43:00.424Z                 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/panicking.rs:72:14
6341    2024-10-17T22:43:00.425Z       2: core::result::unwrap_failed
6342    2024-10-17T22:43:00.425Z                 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/result.rs:1679:5
6343    2024-10-17T22:43:00.425Z       3: core::result::Result<T,E>::expect
6344    2024-10-17T22:43:00.425Z                 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/result.rs:1059:23
6345    2024-10-17T22:43:00.425Z       4: {async_fn#0}
6346    2024-10-17T22:43:00.425Z                 at ./tests/integration_tests/metrics.rs:787:5
6347    2024-10-17T22:43:00.425Z       5: {async_fn#0}<newtype_uuid::TypedUuid<omicron_uuid_kinds::OmicronZoneKind>>
6348    2024-10-17T22:43:00.425Z                 at ./tests/integration_tests/metrics.rs:778:71
6349    2024-10-17T22:43:00.425Z       6: {async_fn#0}
6350    2024-10-17T22:43:00.425Z                 at ./tests/integration_tests/metrics.rs:189:6
6351    2024-10-17T22:43:00.425Z       7: {async_block#0}
6352    2024-10-17T22:43:00.425Z                 at ./tests/integration_tests/metrics.rs:176:1
6353    2024-10-17T22:43:00.425Z       8: poll<&mut dyn core::future::future::Future<Output=()>>
6354    2024-10-17T22:43:00.425Z                 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
6355    2024-10-17T22:43:00.425Z       9: poll<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
6356    2024-10-17T22:43:00.425Z                 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/core/src/future/future.rs:123:9
6357    2024-10-17T22:43:00.425Z      10: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
6358    2024-10-17T22:43:00.425Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:696:57
6359    2024-10-17T22:43:00.425Z      11: with_budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
6360    2024-10-17T22:43:00.425Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/coop.rs:107:5
6361    2024-10-17T22:43:00.425Z      12: budget<core::task::poll::Poll<()>, tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure#0}::{closure#0}::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>>
6362    2024-10-17T22:43:00.425Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/coop.rs:73:5
6363    2024-10-17T22:43:00.425Z      13: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
6364    2024-10-17T22:43:00.425Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:696:25
6365    2024-10-17T22:43:00.425Z      14: tokio::runtime::scheduler::current_thread::Context::enter
6366    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:423:19
6367    2024-10-17T22:43:00.426Z      15: {closure#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>
6368    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:695:36
6369    2024-10-17T22:43:00.426Z      16: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
6370    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:774:68
6371    2024-10-17T22:43:00.426Z      17: tokio::runtime::context::scoped::Scoped<T>::set
6372    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context/scoped.rs:40:9
6373    2024-10-17T22:43:00.426Z      18: tokio::runtime::context::set_scheduler::{{closure}}
6374    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context.rs:180:26
6375    2024-10-17T22:43:00.426Z      19: try_with<tokio::runtime::context::Context, tokio::runtime::context::set_scheduler::{closure_env#0}<(alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>), tokio::runtime::scheduler::current_thread::{impl#8}::enter::{closure_env#0}<tokio::runtime::scheduler::current_thread::{impl#8}::block_on::{closure_env#0}<core::pin::Pin<&mut core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>>, core::option::Option<()>>>, (alloc::boxed::Box<tokio::runtime::scheduler::current_thread::Core, alloc::alloc::Global>, core::option::Option<()>)>
6376    2024-10-17T22:43:00.426Z                 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:283:12
6377    2024-10-17T22:43:00.426Z      20: std::thread::local::LocalKey<T>::with
6378    2024-10-17T22:43:00.426Z                 at /rustc/3f5fd8dd41153bc5fdca9427e9e05be2c767ba23/library/std/src/thread/local.rs:260:9
6379    2024-10-17T22:43:00.426Z      21: tokio::runtime::context::set_scheduler
6380    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context.rs:180:9
6381    2024-10-17T22:43:00.426Z      22: tokio::runtime::scheduler::current_thread::CoreGuard::enter
6382    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:774:27
6383    2024-10-17T22:43:00.426Z      23: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
6384    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:683:19
6385    2024-10-17T22:43:00.426Z      24: {closure#0}<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
6386    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:191:28
6387    2024-10-17T22:43:00.426Z      25: tokio::runtime::context::runtime::enter_runtime
6388    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/context/runtime.rs:65:16
6389    2024-10-17T22:43:00.426Z      26: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
6390    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/scheduler/current_thread/mod.rs:179:9
6391    2024-10-17T22:43:00.426Z      27: tokio::runtime::runtime::Runtime::block_on_inner
6392    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/runtime.rs:361:47
6393    2024-10-17T22:43:00.426Z      28: block_on<core::pin::Pin<&mut dyn core::future::future::Future<Output=()>>>
6394    2024-10-17T22:43:00.426Z                 at /home/build/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.40.0/src/runtime/runtime.rs:335:13
6395    2024-10-17T22:43:00.426Z      29: test_metrics
6396    2024-10-17T22:43:00.427Z                 at ./tests/integration_tests/metrics.rs:176:1
6397    2024-10-17T22:43:00.427Z      30: test_all::integration_tests::metrics::test_metrics::{{closure}}
jgallagher commented 4 weeks ago

I hit this locally and looked into it for a bit. I think this might be a bug / race in Oximeter producer assignment. There are two ways Oximeter decides what its producers are:

  1. It runs a refresh loop where it periodically asks Nexus for its full set of producers. The default interval for this loop is 10 minutes; we don't change that in tests, so it effectively only runs once, around the time Oximeter starts.
  2. When a new producer registers with Nexus, Nexus sends a post to Oximeter telling it about the new producer.

Oximeter logs all of these; snipping out just the relevant bits from a failed test:

13:42:53.584Z INFO test_mgs_metrics (oximeter): refreshing list of producers from Nexus
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    collector_ip = ::1
    file = oximeter/collector/src/agent.rs:761
13:42:53.622Z DEBG test_mgs_metrics (oximeter): registered new metric producer
    address = [::1]:44257
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    collector_ip = ::1
    producer_id = a6458b7d-87c3-4483-be96-854d814c20de
13:42:53.658Z DEBG test_mgs_metrics (oximeter): registered new metric producer
    address = [::1]:40449
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    collector_ip = ::1
    producer_id = 47991ac2-392b-4188-bf19-96e0f0440d14
13:42:53.678Z INFO test_mgs_metrics (oximeter): refreshed list of producers from Nexus
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    collector_ip = ::1
    file = oximeter/collector/src/agent.rs:809
    n_current_tasks = 1
    n_pruned_tasks = 1
13:42:54.442Z DEBG test_mgs_metrics (oximeter): registered new metric producer
    address = 127.0.0.1:41885
    collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
    collector_ip = ::1
    producer_id = e559cb0e-f223-4273-a733-d5f31b1be530

So I think what happened was:

  1. Oximeter sent a request to Nexus to list its current producers; this is a series of paginated requests (in practice in this test it's two requests: one that fetches the first page, and a second that gets an empty result, signaling the end of the list).
  2. Nexus POST'd a new collector (a6458b7d-87c3-4483-be96-854d814c20de; this isn't MGS).
  3. Nexus POST'd another new collector (47991ac2-392b-4188-bf19-96e0f0440d14; this one is MGS).
  4. Oximeter finished getting its list of producers. There was only one producer in this list (n_current_tasks = 1), so it pruned the other one it new about (MGS, because we got unlucky from timing or UUID ordering or something).
  5. Nexus POST'd another new collector (e559cb0e-f223-4273-a733-d5f31b1be530; also not MGS).

At this point Oximeter is aware of 2 collectors but pruned the third, and the one this test in particular cares about. It would have corrected itself the next time the refresh loop ran, but that's 10 minutes away, and the test times out after 1.

bnaecker commented 4 weeks ago

Thanks @jgallagher for the analysis, I think that is all consistent with the behavior we see here.

One possible solution is to keep track of the time we start a refresh and the time every producer is stored. When we complete the refresh and update the internal set of producers, we prune producers that are in our map but not the refreshed list, except those which we inserted after we started the refresh. I think that should ensure that this exact sequence is handled correctly, but we might need to think more carefully about other possible races.

bnaecker commented 4 weeks ago

@jgallagher and I talked in chat a bit about this, and it might be clearer to use generation numbers. We would really need two:

As producers are POST'd to oximeter, we assign them the current collection generation. When oximeter starts to refresh its list, it first records the current generation number of the collection task set. It pulls its entire list, and then calls into ensure_producers() with that generation number it recorded. As it iterates through the list, it mostly operates in the same way it does today, except that it does not prune any producers with a later generation number than was passed into the call.

That guarantees that producers which Nexus sent us between when we recorded the generation and started our own refresh are not pruned. Those would have a later generation number.

andrewjstone commented 4 weeks ago

@jgallagher and I talked in chat a bit about this, and it might be clearer to use generation numbers. We would really need two:

  • a generation number on the whole set of collection tasks
  • a per-producer generation number that records the generation of the set in which the producer was added or updated

As producers are POST'd to oximeter, we assign them the current collection generation. When oximeter starts to refresh its list, it first records the current generation number of the collection task set. It pulls its entire list, and then calls into ensure_producers() with that generation number it recorded. As it iterates through the list, it mostly operates in the same way it does today, except that it does not prune any producers with a later generation number than was passed into the call.

That guarantees that producers which Nexus sent us between when we recorded the generation and started our own refresh are not pruned. Those would have a later generation number.

Epoch based reclamation ;)

hawkw commented 3 weeks ago

I saw a different test (integration_tests::oximeter::test_oximeter_reregistration) fail on #6913: https://buildomat.eng.oxide.computer/wg/0/details/01JAXAGXWPE822857D5D1DP0N4/DMz6rNUPg3AtDtKClnW7rWsxzHSSFztK1IEKxJe1w21To9FJ/01JAXAH9TM5NJ0PDYF6RSJXV5H#S5672

This sure feels related to me, so I thought I'd mention it here instead of making fresh issue for it.

bnaecker commented 2 weeks ago

Closed by #6926