oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

`sled_data_link` no longer returns any metrics #7120

Closed askfongjojo closed 22 hours ago

askfongjojo commented 1 day ago

The issue is seen on rack2 (note: we don't have this problem with switch link metrics). As a sanity check, the same query still returns data on rack3 which is running on v11.

ubuntu@lab:~/oxide.rs/target/debug$ oxide experimental timeseries query --query "{ get sled_data_link:bytes_sent } | filter timestamp > @now() - 5m | first 1" 
{
  "tables": [
    {
      "name": "sled_data_link:bytes_sent",
      "timeseries": {}
    }
  ]
}
ubuntu@lab:~/oxide.rs/target/debug$ oxide version
WARNING: 644 permissions on "/home/ubuntu/.config/oxide/credentials.toml" may allow other users to access your login credentials.
Oxide CLI 0.8.0+20241009.0
Built from commit: c970f5a1bb003b2b7ba8ebbfe8c01f90b9555b60 
Oxide API: 20241204.0.0
bnaecker commented 1 day ago

Well that's not good! It doesn't appear that this is due to moving the endpoint from /v1/timeseries/query to /v1/system/timeseries/query, but maybe @david-crespo can double-check that.

Looking directly at the data in ClickHouse, it appears the last data was from yesterday:

oxz_clickhouse_aa646c82-c6d7-4d0c-8401-150130927759.local :) select timeseries_key, max(timestamp) as t from measurements_cumulativeu64 where timeseries_name = 'sled_data_link:bytes_sent' group by timeseries_key order by t desc limit 10;

SELECT
    timeseries_key,
    max(timestamp) AS t
FROM measurements_cumulativeu64
WHERE timeseries_name = 'sled_data_link:bytes_sent'
GROUP BY timeseries_key
ORDER BY t DESC
LIMIT 10

Query id: c34881a3-dc2c-42c0-80a5-349b57036dec

┌───────timeseries_key─┬─────────────────────────────t─┐
│ 16989108855604198573 │ 2024-11-19 08:15:14.686777292 │
│  2870453436248314684 │ 2024-11-19 08:15:14.686188829 │
│ 12030171303995731862 │ 2024-11-19 08:15:14.685532198 │
│ 15729978772418892426 │ 2024-11-19 08:15:14.684959023 │
│ 14352121081199582386 │ 2024-11-19 08:15:14.684183022 │
│ 11179514372457892773 │ 2024-11-19 08:15:14.683122948 │
│ 16506661459693370751 │ 2024-11-19 08:15:14.681955510 │
│ 11118514652132868344 │ 2024-11-19 08:15:14.681254360 │
│  2016320935908826646 │ 2024-11-19 08:15:14.680637644 │
│ 11600592454884580196 │ 2024-11-19 08:15:14.679946114 │
└──────────────────────┴───────────────────────────────┘

10 rows in set. Elapsed: 0.514 sec. Processed 2.25 million rows, 112.36 MB (4.39 million rows/s., 218.63 MB/s.)
Peak memory usage: 57.06 MiB.

Trawling the ClickHouse log files confirms that some data is still being inserted, which is good.

bnaecker commented 1 day ago

So, it doesn't appear that oximeter is collecting from the sled-agents at all anymore. They show up in its list of assigned producers, but I see no requests to those.

Here are the sled-agents:

root@oxz_switch1:~# omdb db sleds
note: database URL not specified.  Will search DNS.
note: (override with --db-url or OMDB_DB_URL)
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using database URL postgresql://root@[fd00:1122:3344:109::3]:32221,[fd00:1122:3344:105::3]:32221,[fd00:1122:3344:10b::3]:32221,[fd00:1122:3344:107::3]:32221,[fd00:1122:3344:108::3]:32221/omicron?sslmode=disable
note: database schema version matches expected (114.0.0)
note: listing all commissioned sleds (use -F to filter, e.g. -F in-service)
 SERIAL       IP                             ROLE      POLICY             STATE   ID
 BRM42220057  [fd00:1122:3344:104::1]:12345  -         in service         active  0c7011f7-a4bf-4daf-90cc-1c2410103300
 BRM44220011  [fd00:1122:3344:103::1]:12345  -         in service         active  2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa
 BRM42220017  [fd00:1122:3344:109::1]:12345  -         in service         active  5f6720b8-8a31-45f8-8c94-8e699218f28b
 BRM42220051  [fd00:1122:3344:10b::1]:12345  scrimlet  in service         active  71def415-55ad-46b4-ba88-3ca55d7fb287
 BRM44220010  [fd00:1122:3344:101::1]:12345  -         in service         active  7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c
 BRM42220014  [fd00:1122:3344:108::1]:12345  scrimlet  in service         active  87c2c4fc-b0c7-4fef-a305-78f0ed265bbc
 BRM42220031  [fd00:1122:3344:102::1]:12345  -         in service         active  a2adea92-b56e-44fc-8a0d-7d63b5fd3b93
 BRM42220006  [fd00:1122:3344:106::1]:12345  -         in service         active  b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88
 BRM27230045  [fd00:1122:3344:123::1]:12345  -         not provisionable  active  bd96ef7c-4941-4729-b6f7-5f47feecbc4b
 BRM42220009  [fd00:1122:3344:107::1]:12345  -         not provisionable  active  db183874-65b5-4263-a1c1-ddb2737ae0e9
 BRM42220016  [fd00:1122:3344:10a::1]:12345  -         in service         active  dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd
 BRM44220005  [fd00:1122:3344:105::1]:12345  -         in service         active  f15774c1-b8e5-434f-a493-ec43f96cba06

Let's pick the last one, randomly, and see what's going on there. Does oximeter know about it?

root@oxz_switch1:~# omdb oximeter list-producers | grep f15774c1-b8e5-434f-a493-ec43f96cba06
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:10a::3]:12223
f15774c1-b8e5-434f-a493-ec43f96cba06 [fd00:1122:3344:105::1]:40143     30s

Yes, oximeter thinks that sled-agent is in its list of producers, so that's good. But when we go over to oximeter, and look for collections to that address:

root@oxz_oximeter_da510a57:~# grep -c '[fd00:1122:3344:10a::1]:40143' $(svcs -L oximeter)
0

That's not good. Just to confirm, it does look like oximeter is collecting some stuff at the moment:

root@oxz_oximeter_da510a57:~# grep 'fd00:1122:3344:10a::1' $(svcs -L oximeter) | tail -3 | looker
00:10:34.661Z DEBG oximeter (oximeter-agent): collected results from producer
    address = [fd00:1122:3344:10a::1]:8001
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    n_results = 1
    producer_id = 7920845d-61c2-4f6d-b3c2-94583f2fc574
00:10:35.656Z DEBG oximeter (oximeter-agent): collecting from producer
    address = [fd00:1122:3344:10a::1]:8001
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    producer_id = 7920845d-61c2-4f6d-b3c2-94583f2fc574
00:10:35.660Z DEBG oximeter (oximeter-agent): collected results from producer
    address = [fd00:1122:3344:10a::1]:8001
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    n_results = 1
    producer_id = 7920845d-61c2-4f6d-b3c2-94583f2fc574

That's on the same sled, but 8001 is one of the routing daemon's producers (mgd, I think? not sure).

So when was the last time we collected from the sled-agent? There's no easy way to answer this right now (which I should add), other than looking at the log files. The last collections appear to have been around 2024-11-19 08:15:14Z, which is 1732047314 or so. Looking at all the log files in the relevant time range, and grepping for messages indicating we collected from the producer, we see these are the only ones where that message shows up:

BRM42220016 # for f in $(/opt/oxide/oxlog/oxlog logs --archived oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b | grep "\.1732.*" | sort -r); do echo "$f"; grep -c 'collecting from producer.*f15774c1-b8e5-434f-a493-ec43f96cba06' "$f"; done
<a bunch of files with '0'>
/pool/ext/9005671f-3d90-4ed1-be15-ad65b9a65bd5/crypt/debug/oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b/oxide-oximeter:default.log.1732013100
0
/pool/ext/9005671f-3d90-4ed1-be15-ad65b9a65bd5/crypt/debug/oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b/oxide-oximeter:default.log.1732008601
2
/pool/ext/9005671f-3d90-4ed1-be15-ad65b9a65bd5/crypt/debug/oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b/oxide-oximeter:default.log.1732004101
6

So the log file ending in 1732008601 seems like the last one we actually hit that producer in. Let's take a look at that, and see what's going on.

bnaecker commented 1 day ago

Not sure if this is to blame yet, but it definitely doesn't seem like a good idea. It appears that we're refreshing our list of producers from Nexus extremely rapidly, or at least that there is no downtime between attempts to refresh it:

BRM42220016 # grep "list of producers from Nexus" /pool/ext/9005671f-3d90-4ed1-be15-ad65b9a65bd5/crypt/debug/oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b/oxide-oximeter:default.log.1732008601 | head -10 | looker
08:15:15.408Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0
08:15:15.408Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772
08:15:15.460Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0
08:15:15.460Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772
08:15:15.508Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0
08:15:15.508Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772
08:15:15.555Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0
08:15:15.555Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772
08:15:24.647Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0
08:15:24.647Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772

In those first two log messages, we complete a refresh at 08:15:15.408Z and then immediately start another refresh. We do that at least 4 more times, and then that 5th refresh takes about 9 seconds:

08:15:15.555Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772
08:15:24.647Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0

But then we go right back to refreshing the list again immediately:

08:15:24.647Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772

I'd like to know what the actual refresh interval is supposed to be. Our default is 15 seconds, but we're definitely not waiting that long between starting them. It's also possible we're hitting the bursty missed-tick behavior of Tokio's interval somehow. That's the default, and would cause us to queue up a bunch of updates if one happened to take a really long time. Let's see if the refresh before this did take a while.

bnaecker commented 1 day ago

From the previous log file, we see that refreshes do appear to be every 15s (with some jitter), and are taking less than 100ms to complete in general:

BRM42220016 # grep "list of producers from Nexus" /pool/ext/9005671f-3d90-4ed1-be15-ad65b9a65bd5/crypt/debug/oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b/oxide-oximeter:default.log.1732004101 | tail -10 | looker
08:12:44.816Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0
08:12:59.757Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772
08:12:59.807Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0
08:13:14.758Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772
08:13:14.832Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0
08:13:29.759Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772
08:13:29.812Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0
08:13:44.760Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772
08:13:44.815Z INFO oximeter (oximeter-agent): refreshed list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:820
    n_current_tasks = 80
    n_pruned_tasks = 0
08:13:59.760Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772

That last one however, doesn't seem to have completed in this file. If we assume that the log message indicating that the refresh completed is the first message in the next file (we should add a sequence / generation number to these), then that happened at 08:15:15.408Z, which is about 1min 15sec later. That's 5 refresh intervals or so. Looking deeper into the later log file, we see that the next six refreshes happen more or less immediately:

BRM42220016 # grep "list of producers from Nexus" /pool/ext/9005671f-3d90-4ed1-be15-ad65b9a65bd5/crypt/debug/oxz_oximeter_da510a57-3af1-4d2b-b2ed-2e8849f27d8b/oxide-oximeter:default.log.1732008601 | head -20 | jq '.time + ": " + (.msg | split(" ")[0])'
"2024-11-19T08:15:15.408269739Z: refreshed"
"2024-11-19T08:15:15.408279307Z: refreshing"
"2024-11-19T08:15:15.460953026Z: refreshed"
"2024-11-19T08:15:15.460961242Z: refreshing"
"2024-11-19T08:15:15.508338528Z: refreshed"
"2024-11-19T08:15:15.508347726Z: refreshing"
"2024-11-19T08:15:15.555234473Z: refreshed"
"2024-11-19T08:15:15.555242508Z: refreshing"
"2024-11-19T08:15:24.647758957Z: refreshed"
"2024-11-19T08:15:24.647804213Z: refreshing"
"2024-11-19T08:15:24.719137211Z: refreshed"
"2024-11-19T08:15:29.766054794Z: refreshing"
"2024-11-19T08:15:29.820770064Z: refreshed"
"2024-11-19T08:15:44.76646938Z: refreshing"
"2024-11-19T08:15:44.81941319Z: refreshed"
"2024-11-19T08:15:59.767419469Z: refreshing"
"2024-11-19T08:15:59.812106157Z: refreshed"
"2024-11-19T08:16:14.808467669Z: refreshing"
"2024-11-19T08:16:14.858856618Z: refreshed"
"2024-11-19T08:16:29.769225724Z: refreshing"

It's not until 2024-11-19T08:15:44.76646938Z that there's what appears to be a normal, 15s interval between the completion of a refresh (messages with "refreshed", past tense) and the start of the next ("refreshing", present tense).

bnaecker commented 1 day ago

Ok, I don't have strong evidence for this yet, but I'm pretty sure I made a very silly mistake here. I was struggling to see how a bunch of producer refreshes can stall the collection task itself. Those don't communicate directly, but they do communicate: a refresh causes a reset of the task's collection timer. Unless I'm missing something, this means any producer with a collection interval longer than the refresh interval will eventually never be collected from, because the refresh keeps resetting its timer before it can fire.

This explains why we see some data. All the sled agents have a collection interval of 30s, while everything else is shorter:

root@oxz_switch1:~# omdb oximeter list-producers | grep 30s
note: Oximeter URL not specified.  Will pick one from DNS.
note: using DNS server for subnet fd00:1122:3344::/48
note: (if this is not right, use --dns-server to specify an alternate DNS server)
note: using Oximeter URL http://[fd00:1122:3344:10a::3]:12223
0c7011f7-a4bf-4daf-90cc-1c2410103300 [fd00:1122:3344:104::1]:55337     30s
2707b587-9c7f-4fb0-a7af-37c3b7a9a0fa [fd00:1122:3344:103::1]:38687     30s
5f6720b8-8a31-45f8-8c94-8e699218f28b [fd00:1122:3344:109::1]:44121     30s
71def415-55ad-46b4-ba88-3ca55d7fb287 [fd00:1122:3344:10b::1]:53772     30s
7b862eb6-7f50-4c2f-b9a6-0d12ac913d3c [fd00:1122:3344:101::1]:39946     30s
87c2c4fc-b0c7-4fef-a305-78f0ed265bbc [fd00:1122:3344:108::1]:44143     30s
a2adea92-b56e-44fc-8a0d-7d63b5fd3b93 [fd00:1122:3344:102::1]:52776     30s
b886b58a-1e3f-4be1-b9f2-0c2e66c6bc88 [fd00:1122:3344:106::1]:59179     30s
bd96ef7c-4941-4729-b6f7-5f47feecbc4b [fd00:1122:3344:123::1]:47544     30s
db183874-65b5-4263-a1c1-ddb2737ae0e9 [fd00:1122:3344:107::1]:55268     30s
dd83e75a-1edf-4aa1-89a0-cd8b2091a7cd [fd00:1122:3344:10a::1]:64549     30s
f15774c1-b8e5-434f-a493-ec43f96cba06 [fd00:1122:3344:105::1]:40143     30s

That was dumb.

bnaecker commented 1 day ago

For reference, here's the line that unconditionally resets the producer's collection timer:

https://github.com/oxidecomputer/omicron/blob/0b1d42d59867e9a0108dda6f3206c4753c801842/oximeter/collector/src/agent.rs#L196

That should really be doing that only if required, e.g., if the endpoint information has actually changed (new address != old address, or new interval != old interval).