oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

Add API for fetching details about an oximeter producer #7139

Open bnaecker opened 4 days ago

bnaecker commented 4 days ago
bnaecker commented 3 days ago

This PR adds a bunch of useful debugging information into the oximeter collector. I wrote this after debugging #7120 and the related problems, and to help validate #7097. It adds an endpoint into the oximeter collector fetch detailed information about a specific producer, such as the time it was registered or updated; the time of the last successful or failed collection; and the total numbers of successful or failed collections. I've added a basic sanity-check test, and an omdb subcommand for exercising it too. Here's what that looks like.

I started up Omicron on my dev Helios machine, and we can list the producers like so:

bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter list-producers
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
Collector ID: e6b6ef16-59f0-428c-a4da-e00ba1dc7920

Last refresh: 2024-11-22 01:35:32.348074186 UTC

ID                                   ADDRESS                       INTERVAL
1d9c8c20-2218-444b-b860-144b21e4e991 [fd00:1122:3344:101::1]:44361 30s
2d7c6a47-815c-4aab-9c23-9fe6f3af682c [fd00:1122:3344:101::b]:36292 10s
5bd1f5ec-d084-4ebb-8b55-1eef2d1c1002 [fd00:1122:3344:101::c]:58573 10s
6dc59b22-be1a-4c4a-996a-ac5a9cd90870 [fd00:1122:3344:101::2]:8001  1s
79e9a733-2db0-4719-a841-9639afddece2 [fd00:1122:3344:101::a]:57584 10s
b2135f96-792b-44c9-aa6c-b827fa92b556 [fd00:1122:3344:101::1]:8001  1s
c7223ed4-af03-42f9-ace7-7976b8602b4a [fd00:1122:3344:101::2]:4677  1s
d3ec7c1e-99c5-460b-a536-bd5d5d097f68 [fd00:1122:3344:101::2]:40056 10s
f4117644-8add-4fc6-b865-7aa2d2ffb399 [fd00:1122:3344:101::2]:53096 10s

This tool already existed, but now we can drill down to see what's happening in each. Just looking at the first one we get this:

bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter producer-details 106064ab-9f51-4ae5-b1b0-481c087b2a0f
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
              ID: 106064ab-9f51-4ae5-b1b0-481c087b2a0f
         Address: [fd00:1122:3344:101::1]:39136
      Registered: 2024-11-22T02:18:03.612Z
         Updated: 2024-11-22T02:18:03.612Z
        Interval: 30s
 Last collection: 2024-11-22T02:28:03.618Z
    Last success: 2024-11-22T02:28:03.657Z (39.004488ms, 846 samples)
    Last failure: Never
       Successes: 21
        Failures: 0

These all show zero failures because things are working fine on my machine. I wanted to experiment a bit to see what happens when things do start to fail. So I disabled one of the Nexus services, which is producer 40badf8b-9c27-4c5d-a010-81b9bc70d0f8, in the corresponding Nexus zone. When we do that, we start to see this:

note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
              ID: 40badf8b-9c27-4c5d-a010-81b9bc70d0f8
         Address: [fd00:1122:3344:101::a]:34819
      Registered: 2024-11-22T02:18:48.600Z
         Updated: 2024-11-22T02:18:48.600Z
        Interval: 10s
 Last collection: 2024-11-22T02:26:28.604Z
    Last success: 2024-11-22T02:26:18.605Z (1.095562ms, 2 samples)
    Last failure: 2024-11-22T02:26:28.605Z (unreachable)
       Successes: 46
        Failures: 1

So now there are some failures, and the last failure line shows when that happened and why (the server was unreachable). After a few seconds, Nexus comes back up and re-registers itself as a producer, which updates oximeter's information about it. We can see that here:

note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
              ID: 40badf8b-9c27-4c5d-a010-81b9bc70d0f8
         Address: [fd00:1122:3344:101::a]:47557
      Registered: 2024-11-22T02:18:48.600Z
         Updated: 2024-11-22T02:26:33.605Z
        Interval: 10s
 Last collection: 2024-11-22T02:26:43.605Z
    Last success: 2024-11-22T02:26:43.607Z (1.658154ms, 2 samples)
    Last failure: 2024-11-22T02:26:28.605Z (unreachable)
       Successes: 47
        Failures: 1

The number of successes has incremented, and the address has changed. Note that the lines printing the last failure and success are sticky, so the last failure will stick around forever, even if it was a long time ago. I've found that pretty helpful.

This is all in addition to the timeseries we're already reporting showing the cumulative number of collections and failures, broken down by the reason for the failure. We can see this failure here:

bnaecker@shale : ~/omicron $ ./target/release/omdb oxql
note: ClickHouse URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
Oximeter Query Language shell

Basic commands:
  \?, \h, help       - Print this help
  \q, quit, exit, ^D - Exit the shell
  \l                 - List timeseries
  \d <timeseries>    - Describe a timeseries
  \ql [<operation>]  - Get OxQL help about an operation

Or try entering an OxQL `get` query
0x〉get oximeter_collector:failed_collections | filter producer_id == "40badf8b-9c27-4c5d-a010-81b9bc70d0f8" | last 1

oximeter_collector:failed_collections

 base_route:
 collector_id: b8043883-0e83-4e39-9057-c1189a1905d2
 collector_ip: fd00:1122:3344:101::d
 collector_port: 12223
 producer_id: 40badf8b-9c27-4c5d-a010-81b9bc70d0f8
 producer_ip: fd00:1122:3344:101::a
 producer_port: 34819
 reason: unreachable
   [2024-11-22 02:26:28.605131895, 2024-11-22 02:33:48.609147507]: [1]