oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
223 stars 33 forks source link

spurious failures from disk/instance metric test #2994

Closed davepacheco closed 1 year ago

davepacheco commented 1 year ago

I've seen a couple of apparently-spurious failures of test_instance_metrics and test_disk_metrics. Here's one from CI:

https://github.com/oxidecomputer/omicron/pull/2951/checks?check_run_id=13190550818 (buildomat job id 01GZFQ8S6NGARWQ8Z7HECHBTA2)

5045 | 2023-05-03T03:39:35.710Z | ---- integration_tests::instances::test_instance_metrics stdout ----
-- | -- | --
5046 | 2023-05-03T03:39:35.716Z | log file: /var/tmp/omicron_tmp/test_all-208dd36b5428cdaf-test_instance_metrics.7237.69.log
5047 | 2023-05-03T03:39:35.722Z | note: configured to log to "/var/tmp/omicron_tmp/test_all-208dd36b5428cdaf-test_instance_metrics.7237.69.log"
5048 | 2023-05-03T03:39:35.727Z | thread 'integration_tests::instances::test_instance_metrics' panicked at 'Expected at least one measurement', nexus/tests/integration_tests/metrics.rs:34:5
5049 | 2023-05-03T03:39:35.733Z | note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
5050 | 2023-05-03T03:39:35.738Z | WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
5051 | 2023-05-03T03:39:35.744Z | WARN: temporary directory leaked: /var/tmp/omicron_tmp/.tmpkj2Pnr
5052 | 2023-05-03T03:39:35.750Z | WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
5053 | 2023-05-03T03:39:35.758Z | WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
5054 | 2023-05-03T03:39:35.769Z | WARN: dendrite temporary directory leaked: /var/tmp/omicron_tmp/.tmpg0eygN
5055 | 2023-05-03T03:39:35.774Z |  
5056 | 2023-05-03T03:39:35.780Z |  
5057 | 2023-05-03T03:39:35.785Z | failures:
5058 | 2023-05-03T03:39:35.790Z | integration_tests::instances::test_instance_metrics
davepacheco commented 1 year ago

This was not easy to reproduce locally and we don't have as much information as would be useful out of the failures. See #2995.

I did catch one with:

while RUST_BACKTRACE=1 cargo test -p omicron-nexus -- integration_tests::instances integration_tests::disks; do :; done
...
failures:

---- integration_tests::disks::test_disk_metrics stdout ----
log file: /dangerzone/omicron_tmp/test_all-842a3bf74e59f078-test_disk_metrics.18714.8.log
note: configured to log to "/dangerzone/omicron_tmp/test_all-842a3bf74e59f078-test_disk_metrics.18714.8.log"
thread 'integration_tests::disks::test_disk_metrics' panicked at 'Expected at least one measurement', nexus/tests/integration_tests/metrics.rs:34:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: test_all::integration_tests::metrics::query_for_latest_metric::{{closure}}
             at ./tests/integration_tests/metrics.rs:34:5
   3: test_all::integration_tests::disks::test_disk_metrics::{{closure}}::test_disk_metrics::{{closure}}
             at ./tests/integration_tests/disks.rs:1419:70
   4: test_all::integration_tests::disks::test_disk_metrics::{{closure}}
             at ./tests/integration_tests/disks.rs:1350:1
   5: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/future/future.rs:125:9
   6: <core::pin::Pin<P> as core::future::future::Future>::poll
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/future/future.rs:125:9
   7: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}}
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/scheduler/current_thread.rs:541:57
   8: tokio::runtime::coop::with_budget
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/coop.rs:107:5
   9: tokio::runtime::coop::budget
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/coop.rs:73:5
  10: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/scheduler/current_thread.rs:541:25
  11: tokio::runtime::scheduler::current_thread::Context::enter
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/scheduler/current_thread.rs:350:19
  12: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/scheduler/current_thread.rs:540:36
  13: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}}
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/scheduler/current_thread.rs:615:57
  14: tokio::macros::scoped_tls::ScopedKey<T>::set
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/macros/scoped_tls.rs:61:9
  15: tokio::runtime::scheduler::current_thread::CoreGuard::enter
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/scheduler/current_thread.rs:615:27
  16: tokio::runtime::scheduler::current_thread::CoreGuard::block_on
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/scheduler/current_thread.rs:530:19
  17: tokio::runtime::scheduler::current_thread::CurrentThread::block_on
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/scheduler/current_thread.rs:154:24
  18: tokio::runtime::runtime::Runtime::block_on
             at /home/dap/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.28.0/src/runtime/runtime.rs:302:47
  19: test_all::integration_tests::disks::test_disk_metrics
             at ./tests/integration_tests/disks.rs:1350:1
  20: test_all::integration_tests::disks::test_disk_metrics::{{closure}}
             at ./tests/integration_tests/disks.rs:1350:1
  21: core::ops::function::FnOnce::call_once
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/ops/function.rs:250:5
  22: core::ops::function::FnOnce::call_once
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
WARN: dropped CockroachInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: temporary directory leaked: /dangerzone/omicron_tmp/.tmpagosuV
WARN: dropped ClickHouseInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dropped DendriteInstance without cleaning it up first (there may still be a child process running and a temporary directory leaked)
WARN: dendrite temporary directory leaked: /dangerzone/omicron_tmp/.tmpaAr1Me

failures:
    integration_tests::disks::test_disk_metrics

test result: FAILED. 45 passed; 1 failed; 0 ignored; 0 measured; 180 filtered out; finished in 67.45s

error: test failed, to rerun pass `-p omicron-nexus --test test_all`
davepacheco commented 1 year ago

I think I've gotten to the bottom of the failure that I reproduced. The short version is that when fetching metrics from clickhouse, we're using a 20s window around the current time. In my failure, the metric was last updated more than 10s before we went to fetch it, so we didn't see that measurement. From talking to @smklein, I think this 20s window came from an earlier version where we didn't have a way to force oximeter collection. Now that we can do that, and I do believe this is synchronous, I think the window could be [test_start_time, now]. I think we also don't need to loop querying until we find a measurement because we should always find one by the time we go look for it, now that we have a way to wait for oximeter collection to finish.

I'm not sure if this is the cause for the "test_instance_metrics" failure that I saw locally (but don't have the data for any more) and I have not yet confirmed if this is the cause of the CI failure linked in the description.


Here's the data and analysis.

The test itself is fairly short: https://github.com/oxidecomputer/omicron/blob/adcc588c92bf422af38981f39f2d6907949e1578/nexus/tests/integration_tests/disks.rs#L1350-L1422

Basically, we:

In the failure above, we blew an assertion in the very last step. While querying for the latest project-level utilization metric, we found zero data points. Strange! We'd previously already verified this. I went down a bunch of blind alleys but ultimately worked backwards.

First, was the final data in Cockroach correct? I used this early log message:

17:22:03.525Z INFO test_disk_metrics: cockroach command line: cockroach start-single-node --insecure --http-addr=:0 --store=path=/dangerzone/omicron_tmp/.tmpagosuV/data,ballast-size=0 --listen-addr 127.0.0.1:0 --listening-url-file /dangerzone/omicron_tmp/.tmpagosuV/listen-url

to find the cockroach store dir and started up the database so I could inspect what was in CockroachDB:

$ cargo run --bin=omicron-dev -- db-run --store-dir=/dangerzone/omicron_tmp/.tmpagosuV/data --no-populate

and connected:

$ cockroach sql --url postgresql://root@127.0.0.1:32221/defaultdb?sslmode=disable
...
root@127.0.0.1:32221/omicron> select * from virtual_provisioning_collection;
                   id                  |         time_modified         | collection_type | virtual_disk_bytes_provisioned | cpus_provisioned | ram_provisioned
---------------------------------------+-------------------------------+-----------------+--------------------------------+------------------+------------------
  001de000-1334-4000-8000-000000000000 | 2023-05-03 17:22:21.216853+00 | Fleet           |                     1073741824 |                4 |      1073741824
  001de000-5110-4000-8000-000000000000 | 2023-05-03 17:22:21.216853+00 | Silo            |                     1073741824 |                4 |      1073741824
  211ebf0f-ec5f-4e4e-a121-221aed1d8443 | 2023-05-03 17:22:12.581317+00 | Silo            |                              0 |                0 |               0
  c32f65a5-d23a-4750-ad67-53194e24184b | 2023-05-03 17:22:21.216853+00 | Project         |                     1073741824 |                4 |      1073741824
(4 rows)

Time: 1ms total (execution 1ms / network 0ms)

root@127.0.0.1:32221/omicron> select * from virtual_provisioning_resource;
                   id                  |         time_modified         | resource_type | virtual_disk_bytes_provisioned | cpus_provisioned | ram_provisioned
---------------------------------------+-------------------------------+---------------+--------------------------------+------------------+------------------
  aa73f0bb-9ccf-4081-aa99-2174a6c8401b | 2023-05-03 17:22:21.216853+00 | instance      |                              0 |                4 |      1073741824
  ff0c8081-306a-46bd-abe3-5849e837fc6e | 2023-05-03 17:22:15.911197+00 | disk          |                     1073741824 |                0 |               0
(2 rows)

That data looks right. So something went wrong propagating it to Clickhouse or getting it back out of Clickhouse/Nexus.

I wanted to check if the data was in Clickhouse, but there's nothing in the test log about where the Clickhouse database was or how to start Clickhouse so I gave up. I probably could have tried harder to figure out which directory it was based on mtimes in $TMP. But I thought maybe we could get what we need from the log. Near the very end of the bunyan log, we have:

17:22:28.811Z TRAC 56f49b9f-28fa-4bdf-9c74-afcafdccc8c5 (ServerContext): executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:virtual_disk_space_provisioned' AND timeseries_key IN (4708044686775847191)  AND timestamp >= '2023-05-03 17:22:18.686647437' AND timestamp < '2023-05-03 17:22:38.686654611' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow;
    id = 91f5c97f-02fd-48f6-a5dd-a63d848aaa0d
...
17:22:28.821Z INFO 56f49b9f-28fa-4bdf-9c74-afcafdccc8c5 (dropshot_external): request completed
    local_addr = 127.0.0.1:43406
    method = GET
    remote_addr = 127.0.0.1:42196
    req_id = 5b1f398d-0d1c-4057-b965-898b099bf9a2
    response_code = 200
    uri = /v1/system/metrics/virtual_disk_space_provisioned?start_time=2023-05-03T17:22:18.686647437Z&end_time=2023-05-03T17:22:38.686654611Z&id=c32f65a5-d23a-4750-ad67-53194e24184b

This last entry looks like the request that the test made to fetch the project-level space utilization metrics. (I confirmed from elsewhere in the log that c32f65a5-d23a-4750-ad67-53194e24184b is the project id that we created.) I infer that the Clickhouse SELECT log message shortly before that corresponds to this request (though I don't think there's a way to be sure -- it'd be nice if we passed the req_id down to that logger -- will file an issue on that). From that log entry, we see the timeseries_key is 4708044686775847191. From my recollection, this key identifies a unique combination of field key-value pairs for a particular metric -- it essentially identifies a sequence of measurements for the same metric (like "disk space provisioned for project foo"). I grep'd for that log message to find everything we recorded about it, like what field values it represents and what data points we recorded:

dap@ivanova ~ $ grep 4708044686775847191 /dangerzone/omicron_tmp/test_all-842a3bf74e59f078-test_disk_metrics.18714.8.log | bunyan
[2023-05-03T10:22:17.021776545-07:00] TRACE: test_disk_metrics/oximeter/18714 on ivanova: (id=af2164cb-159c-4d59-9f3c-73f0601c3da4, collector_id=39e6175b-4df2-4730-b11d-cbc1e60a2e78)
    executing SQL query: INSERT INTO oximeter.fields_uuid FORMAT JSONEachRow
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":4208040974627652584,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":9495459072424719939,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":12585158022270206444,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":5796947998872893699,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":14301957739806698488,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":5565528523582502321,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":15540283427413080090,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":7043406632879339461,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":12222668309244170383,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":15502487255550903642,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":5670937711533663662,"field_name":"id","field_value":"001de000-1334-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":5670937711533663662,"field_name":"id","field_value":"001de000-1334-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":5670937711533663662,"field_name":"id","field_value":"001de000-1334-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":6618305027551830035,"field_name":"id","field_value":"211ebf0f-ec5f-4e4e-a121-221aed1d8443"}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":6618305027551830035,"field_name":"id","field_value":"211ebf0f-ec5f-4e4e-a121-221aed1d8443"}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":6618305027551830035,"field_name":"id","field_value":"211ebf0f-ec5f-4e4e-a121-221aed1d8443"}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":4708044686775847191,"field_name":"id","field_value":"c32f65a5-d23a-4750-ad67-53194e24184b"}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":4708044686775847191,"field_name":"id","field_value":"c32f65a5-d23a-4750-ad67-53194e24184b"}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":4708044686775847191,"field_name":"id","field_value":"c32f65a5-d23a-4750-ad67-53194e24184b"}

[2023-05-03T10:22:17.02943188-07:00] TRACE: test_disk_metrics/oximeter/18714 on ivanova: (id=af2164cb-159c-4d59-9f3c-73f0601c3da4, collector_id=39e6175b-4df2-4730-b11d-cbc1e60a2e78)
    executing SQL query: INSERT INTO oximeter.measurements_i64 FORMAT JSONEachRow
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 17:22:06.740977000","datum":0}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 17:22:06.740977000","datum":0}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 17:22:06.740977000","datum":0}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":5670937711533663662,"timestamp":"2023-05-03 17:22:06.764747000","datum":0}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":5670937711533663662,"timestamp":"2023-05-03 17:22:06.764747000","datum":0}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":5670937711533663662,"timestamp":"2023-05-03 17:22:06.764747000","datum":0}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":6618305027551830035,"timestamp":"2023-05-03 17:22:12.581317000","datum":0}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":6618305027551830035,"timestamp":"2023-05-03 17:22:12.581317000","datum":0}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":6618305027551830035,"timestamp":"2023-05-03 17:22:12.581317000","datum":0}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":4708044686775847191,"timestamp":"2023-05-03 17:22:14.358782000","datum":0}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":4708044686775847191,"timestamp":"2023-05-03 17:22:14.358782000","datum":0}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":4708044686775847191,"timestamp":"2023-05-03 17:22:14.358782000","datum":0}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":5670937711533663662,"timestamp":"2023-05-03 17:22:15.911197000","datum":1073741824}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 17:22:15.911197000","datum":1073741824}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":4708044686775847191,"timestamp":"2023-05-03 17:22:15.911197000","datum":1073741824}

[2023-05-03T10:22:20.569426974-07:00] TRACE: 56f49b9f-28fa-4bdf-9c74-afcafdccc8c5/ServerContext/18714 on ivanova: executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:virtual_disk_space_provisioned' AND timeseries_key IN (4708044686775847191)  AND timestamp >= '2023-05-03 17:22:10.404920042' AND timestamp < '2023-05-03 17:22:30.404924119' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow; (id=bf31ad5f-6474-4d2d-8aef-8b6c743cfeac)
[2023-05-03T10:22:24.237426025-07:00] TRACE: test_disk_metrics/oximeter/18714 on ivanova: (id=af2164cb-159c-4d59-9f3c-73f0601c3da4, collector_id=39e6175b-4df2-4730-b11d-cbc1e60a2e78)
    executing SQL query: INSERT INTO oximeter.fields_uuid FORMAT JSONEachRow
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":15198642437141513576,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":4208040974627652584,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":9495459072424719939,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":12585158022270206444,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":5796947998872893699,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":14301957739806698488,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":5565528523582502321,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":15540283427413080090,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":10096899594926473859,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":7043406632879339461,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":12222668309244170383,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":15502487255550903642,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"http_service:request_latency_histogram","timeseries_key":6566481830284667455,"field_name":"id","field_value":"56f49b9f-28fa-4bdf-9c74-afcafdccc8c5"}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":5670937711533663662,"field_name":"id","field_value":"001de000-1334-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":4708044686775847191,"field_name":"id","field_value":"c32f65a5-d23a-4750-ad67-53194e24184b"}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":5670937711533663662,"field_name":"id","field_value":"001de000-1334-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":4708044686775847191,"field_name":"id","field_value":"c32f65a5-d23a-4750-ad67-53194e24184b"}

[2023-05-03T10:22:24.248092845-07:00] TRACE: test_disk_metrics/oximeter/18714 on ivanova: (id=af2164cb-159c-4d59-9f3c-73f0601c3da4, collector_id=39e6175b-4df2-4730-b11d-cbc1e60a2e78)
    executing SQL query: INSERT INTO oximeter.measurements_i64 FORMAT JSONEachRow
    {"timeseries_name":"integration_target:integration_metric","timeseries_key":6257040532384654857,"timestamp":"2023-05-03 17:22:24.189096604","datum":10}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":5670937711533663662,"timestamp":"2023-05-03 17:22:21.216853000","datum":4}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 17:22:21.216853000","datum":4}
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":4708044686775847191,"timestamp":"2023-05-03 17:22:21.216853000","datum":4}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":5670937711533663662,"timestamp":"2023-05-03 17:22:21.216853000","datum":1073741824}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 17:22:21.216853000","datum":1073741824}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":4708044686775847191,"timestamp":"2023-05-03 17:22:21.216853000","datum":1073741824}

[2023-05-03T10:22:28.811263447-07:00] TRACE: 56f49b9f-28fa-4bdf-9c74-afcafdccc8c5/ServerContext/18714 on ivanova: executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:virtual_disk_space_provisioned' AND timeseries_key IN (4708044686775847191)  AND timestamp >= '2023-05-03 17:22:18.686647437' AND timestamp < '2023-05-03 17:22:38.686654611' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow; (id=91f5c97f-02fd-48f6-a5dd-a63d848aaa0d)

Now, we can combine these log entries to put together a timeline

17:22:17.021Z schema created
17:22:14.358Z measurement created timeseries key 4708044686775847191 virtual disk space provisioned = 0
17:22:15.911Z measurement created timeseries key 4708044686775847191 virtual disk space provisioned = 1073741824
17:22:17.029Z insert measurements above into Clickhouse
17:22:28.811Z execute Clickhouse query for these measurements with time filter between 17:22:18.686647437 and 2023-05-03 17:22:38.686

We can see clearly from this timeline that the query we made for measurements specifies a time window that's entirely after the last measurement we actually have for this timeseries. That seems to be just a result of it taking longer than 10s from the last time the project-level data point was updated to when we went to query for it. That's why I think the solution should be as simple as adjusting the time window.

In terms of what should the window bounds be: for the start time, it seems like we may as well grab a timestamp at the start of the test and go back that far. I can't think of why that wouldn't be far enough. As for the end time: we currently use now+10s, which is a little surprising since we're on a single host. I think this must be from the time when this propagation was asynchronous and we would retry our request several times. I think it would be better to use "now" every time we make the request (rather than picking "now+10s" once and then retrying with that timestamp forever). Fortunately, the only place we retry is in query_for_metrics_until_they_exist, and I think we don't need to do this any more since they should exist as soon as we look for them now. (If that's wrong, we can keep retrying, but I think we should generate a new "now" timestamp every time.)

davepacheco commented 1 year ago

Here's the raw log file from my repro. 2994-disk-repro.log.zip

davepacheco commented 1 year ago

CI test failure: Okay, I've looked at the CI test failure and I think it's the same pattern of problem. The test_instance_metrics test uses a +- 30s window. Also, we don't have the panic stack, so I don't actually know which assertion failed, so I don't know for sure this is right. But I think we can say that we made a query for metrics expecting to find a particular data point, found none, but there was a data point shortly before the window that we searched for.

Here's the CI failure from the description: https://buildomat.eng.oxide.computer/wg/0/details/01GZFQ87KD8GCG95KX2XWYB3A3/AbVQVhAR0tTs3LH7Z9V8r8651oeLGkKpzDWjx1S1AL9Y7oKx/01GZFQ8GM5EJ9BYMA8E06X9PFQ

Once again we panicked expecting "at least one measurement" This was in query_for_latest_metric(): https://github.com/oxidecomputer/omicron/blob/adcc588c92bf422af38981f39f2d6907949e1578/nexus/tests/integration_tests/metrics.rs#L34

This time, the last messages were:

03:31:26.579Z TRAC dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): executing SQL query: SELECT filter0.timeseries_key as timeseries_key, filter0.field_name, filter0.field_value FROM (SELECT * FROM oximeter.fields_uuid WHERE timeseries_name = 'collection_target:virtual_disk_space_provisioned' AND field_name = 'id' AND field_value = '001de000-5110-4000-8000-000000000000') AS filter0 ORDER BY (filter0.timeseries_name, filter0.timeseries_key) FORMAT JSONEachRow;
    id = d2b3d3f4-9358-4cac-83f2-c7fe42c75752
03:31:26.592Z TRAC dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:virtual_disk_space_provisioned' AND timeseries_key IN (8221072622894359202)  AND timestamp >= '2023-05-03 03:30:56.306506695' AND timestamp < '2023-05-03 03:31:56.306513549' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow;
    id = d2b3d3f4-9358-4cac-83f2-c7fe42c75752
03:31:26.603Z INFO dcc72eaa-dba7-45e1-8d41-7953b908644a (dropshot_external): request completed
    local_addr = 127.0.0.1:49617
    method = GET 
    remote_addr = 127.0.0.1:61415
    req_id = ae27985c-3d73-415e-bae0-c27546c1331b
    response_code = 200 
    uri = /v1/system/metrics/virtual_disk_space_provisioned?start_time=2023-05-03T03:30:56.306506695Z&end_time=2023-05-03T03:31:56.306513549Z&id=001de000-5110-4000-8000-000000000000
03:31:26.604Z INFO test_instance_metrics (external client test context): client received response
    status = 200 

From this, we see the timeseries_key 8221072622894359202 and timestamp window 2023-05-3 03:30:56.306 to 2023-05-03 03:31:56. Here I'm grep'ing first for the timeseries key with grep 8221072622894359202 test_all-208dd36b5428cdaf-test_instance_metrics.7237.69.log | looker and ran that through grep '^0\|8221072622894359202':

03:31:16.443Z TRAC test_instance_metrics (oximeter): executing SQL query: INSERT INTO oximeter.fields_uuid FORMAT JSONEachRow
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
03:31:16.457Z TRAC test_instance_metrics (oximeter): executing SQL query: INSERT INTO oximeter.measurements_i64 FORMAT JSONEachRow
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 03:30:53.702962000","datum":0}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 03:30:53.702962000","datum":0}
    {"timeseries_name":"collection_target:virtual_disk_space_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 03:30:53.702962000","datum":0}
03:31:16.581Z TRAC dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:virtual_disk_space_provisioned' AND timeseries_key IN (8221072622894359202)  AND timestamp >= '2023-05-03 03:30:46.461851813' AND timestamp < '2023-05-03 03:31:46.461858229' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow;
03:31:16.675Z TRAC dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:cpus_provisioned' AND timeseries_key IN (8221072622894359202)  AND timestamp >= '2023-05-03 03:30:46.601328046' AND timestamp < '2023-05-03 03:31:46.601333456' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow;
03:31:16.760Z TRAC dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:ram_provisioned' AND timeseries_key IN (8221072622894359202)  AND timestamp >= '2023-05-03 03:30:46.685433331' AND timestamp < '2023-05-03 03:31:46.685437711' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow;
03:31:23.591Z TRAC test_instance_metrics (oximeter): executing SQL query: INSERT INTO oximeter.fields_uuid FORMAT JSONEachRow
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
03:31:23.614Z TRAC test_instance_metrics (oximeter): executing SQL query: INSERT INTO oximeter.measurements_i64 FORMAT JSONEachRow
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 03:31:17.452606000","datum":4}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 03:31:17.452606000","datum":1073741824}
03:31:23.708Z TRAC dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:virtual_disk_space_provisioned' AND timeseries_key IN (8221072622894359202)  AND timestamp >= '2023-05-03 03:30:53.619113354' AND timestamp < '2023-05-03 03:31:53.619120831' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow;
03:31:23.799Z TRAC dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:cpus_provisioned' AND timeseries_key IN (8221072622894359202)  AND timestamp >= '2023-05-03 03:30:53.719379275' AND timestamp < '2023-05-03 03:31:53.719384208' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow;
03:31:23.898Z TRAC dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:ram_provisioned' AND timeseries_key IN (8221072622894359202)  AND timestamp >= '2023-05-03 03:30:53.810497375' AND timestamp < '2023-05-03 03:31:53.810502427' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow;
03:31:26.289Z TRAC test_instance_metrics (oximeter): executing SQL query: INSERT INTO oximeter.fields_uuid FORMAT JSONEachRow
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":8221072622894359202,"field_name":"id","field_value":"001de000-5110-4000-8000-000000000000"}
03:31:26.300Z TRAC test_instance_metrics (oximeter): executing SQL query: INSERT INTO oximeter.measurements_i64 FORMAT JSONEachRow
    {"timeseries_name":"collection_target:cpus_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 03:31:25.977307000","datum":0}
    {"timeseries_name":"collection_target:ram_provisioned","timeseries_key":8221072622894359202,"timestamp":"2023-05-03 03:31:25.977307000","datum":0}
03:31:26.592Z TRAC dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): executing SQL query: SELECT * FROM oximeter.measurements_i64 WHERE timeseries_name = 'collection_target:virtual_disk_space_provisioned' AND timeseries_key IN (8221072622894359202)  AND timestamp >= '2023-05-03 03:30:56.306506695' AND timestamp < '2023-05-03 03:31:56.306513549' ORDER BY (timeseries_name, timeseries_key, timestamp) LIMIT 100 FORMAT JSONEachRow;

There's a bunch of noise. But from the SQL we're looking for virtual_disk_space_provisioned again. The history of that one is:

03:30:53.702Z measurement created: datum = 0
03:31:16.443Z schema created
03:31:16.457Z insert measurement above
03:31:26.592Z query for measurements between 03:30:56.306 to 03:31:56 and find none

Further, if we go back to the log messages around 03:30:53.702Z, we find stuff like:

03:30:53.690Z INFO dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): created 1 built-in silos
...
03:30:53.721Z DEBG dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): attempting to create silo users
...
03:30:53.961Z INFO dcc72eaa-dba7-45e1-8d41-7953b908644a (ServerContext): populate complete; activating background tasks

So this initial data point (presumably for the Silo-level usage) was created during Nexus's "populate" step, which is actually before this test even started running (that is, while initializing the ControlPlaneTestContext). To fix this window, I think we need ControlPlaneTestContext to record a timestamp from before it started doing anything else and then use this as the start of our window.