Open bnaecker opened 4 days ago
This PR adds a bunch of useful debugging information into the oximeter
collector. I wrote this after debugging #7120 and the related problems, and to help validate #7097. It adds an endpoint into the oximeter
collector fetch detailed information about a specific producer, such as the time it was registered or updated; the time of the last successful or failed collection; and the total numbers of successful or failed collections. I've added a basic sanity-check test, and an omdb
subcommand for exercising it too. Here's what that looks like.
I started up Omicron on my dev Helios machine, and we can list the producers like so:
bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter list-producers
note: Oximeter URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
Collector ID: e6b6ef16-59f0-428c-a4da-e00ba1dc7920
Last refresh: 2024-11-22 01:35:32.348074186 UTC
ID ADDRESS INTERVAL
1d9c8c20-2218-444b-b860-144b21e4e991 [fd00:1122:3344:101::1]:44361 30s
2d7c6a47-815c-4aab-9c23-9fe6f3af682c [fd00:1122:3344:101::b]:36292 10s
5bd1f5ec-d084-4ebb-8b55-1eef2d1c1002 [fd00:1122:3344:101::c]:58573 10s
6dc59b22-be1a-4c4a-996a-ac5a9cd90870 [fd00:1122:3344:101::2]:8001 1s
79e9a733-2db0-4719-a841-9639afddece2 [fd00:1122:3344:101::a]:57584 10s
b2135f96-792b-44c9-aa6c-b827fa92b556 [fd00:1122:3344:101::1]:8001 1s
c7223ed4-af03-42f9-ace7-7976b8602b4a [fd00:1122:3344:101::2]:4677 1s
d3ec7c1e-99c5-460b-a536-bd5d5d097f68 [fd00:1122:3344:101::2]:40056 10s
f4117644-8add-4fc6-b865-7aa2d2ffb399 [fd00:1122:3344:101::2]:53096 10s
This tool already existed, but now we can drill down to see what's happening in each. Just looking at the first one we get this:
bnaecker@shale : ~/omicron $ ./target/release/omdb oximeter producer-details 106064ab-9f51-4ae5-b1b0-481c087b2a0f
note: Oximeter URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
ID: 106064ab-9f51-4ae5-b1b0-481c087b2a0f
Address: [fd00:1122:3344:101::1]:39136
Registered: 2024-11-22T02:18:03.612Z
Updated: 2024-11-22T02:18:03.612Z
Interval: 30s
Last collection: 2024-11-22T02:28:03.618Z
Last success: 2024-11-22T02:28:03.657Z (39.004488ms, 846 samples)
Last failure: Never
Successes: 21
Failures: 0
These all show zero failures because things are working fine on my machine. I wanted to experiment a bit to see what happens when things do start to fail. So I disabled one of the Nexus services, which is producer 40badf8b-9c27-4c5d-a010-81b9bc70d0f8
, in the corresponding Nexus zone. When we do that, we start to see this:
note: Oximeter URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
ID: 40badf8b-9c27-4c5d-a010-81b9bc70d0f8
Address: [fd00:1122:3344:101::a]:34819
Registered: 2024-11-22T02:18:48.600Z
Updated: 2024-11-22T02:18:48.600Z
Interval: 10s
Last collection: 2024-11-22T02:26:28.604Z
Last success: 2024-11-22T02:26:18.605Z (1.095562ms, 2 samples)
Last failure: 2024-11-22T02:26:28.605Z (unreachable)
Successes: 46
Failures: 1
So now there are some failures, and the last failure line shows when that happened and why (the server was unreachable). After a few seconds, Nexus comes back up and re-registers itself as a producer, which updates oximeter
's information about it. We can see that here:
note: Oximeter URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:101::d]:12223
ID: 40badf8b-9c27-4c5d-a010-81b9bc70d0f8
Address: [fd00:1122:3344:101::a]:47557
Registered: 2024-11-22T02:18:48.600Z
Updated: 2024-11-22T02:26:33.605Z
Interval: 10s
Last collection: 2024-11-22T02:26:43.605Z
Last success: 2024-11-22T02:26:43.607Z (1.658154ms, 2 samples)
Last failure: 2024-11-22T02:26:28.605Z (unreachable)
Successes: 47
Failures: 1
The number of successes has incremented, and the address has changed. Note that the lines printing the last failure and success are sticky, so the last failure will stick around forever, even if it was a long time ago. I've found that pretty helpful.
This is all in addition to the timeseries we're already reporting showing the cumulative number of collections and failures, broken down by the reason for the failure. We can see this failure here:
bnaecker@shale : ~/omicron $ ./target/release/omdb oxql
note: ClickHouse URL not specified. Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
Oximeter Query Language shell
Basic commands:
\?, \h, help - Print this help
\q, quit, exit, ^D - Exit the shell
\l - List timeseries
\d <timeseries> - Describe a timeseries
\ql [<operation>] - Get OxQL help about an operation
Or try entering an OxQL `get` query
0x〉get oximeter_collector:failed_collections | filter producer_id == "40badf8b-9c27-4c5d-a010-81b9bc70d0f8" | last 1
oximeter_collector:failed_collections
base_route:
collector_id: b8043883-0e83-4e39-9057-c1189a1905d2
collector_ip: fd00:1122:3344:101::d
collector_port: 12223
producer_id: 40badf8b-9c27-4c5d-a010-81b9bc70d0f8
producer_ip: fd00:1122:3344:101::a
producer_port: 34819
reason: unreachable
[2024-11-22 02:26:28.605131895, 2024-11-22 02:33:48.609147507]: [1]
producer_details
API tooximeter
collector, which returns information about registration time, update time, and collection summaries.omdb oximeter producer-details
subcommand for printing