oxidecomputer / omicron

Omicron: Oxide control plane
Mozilla Public License 2.0
252 stars 40 forks source link

`oximeter` is too eager to delete its entire state #7124

Closed bnaecker closed 22 hours ago

bnaecker commented 1 day ago

While debugging #7120, I dug into a bunch of oximeter logs. In one of them, I noticed this sequence:

08:10:29.749Z INFO oximeter (oximeter-agent): refreshing list of producers from Nexus
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    file = oximeter/collector/src/agent.rs:772
08:10:29.749Z DEBG oximeter: client request
    body = None
    method = GET
    uri = http://[fd00:1122:3344:103::3]:12221/metrics/collectors/da510a57-3af1-4d2b-b2ed-2e8849f27d8b/producers?sort_
by=id_ascending
08:10:29.750Z DEBG oximeter: client response
    result = Err(reqwest::Error { kind: Request, url: "http://[fd00:1122:3344:103::3]:12221/metrics/collectors/da510a5
7-3af1-4d2b-b2ed-2e8849f27d8b/producers?sort_by=id_ascending", source: hyper_util::client::legacy::Error(Connect, Conn
ectError("tcp connect error", Os { code: 146, kind: ConnectionRefused, message: "Connection refused" })) })
08:10:29.750Z ERRO oximeter (oximeter-agent): error fetching next assigned producer
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    err = Communication Error: error sending request for url (http://[fd00:1122:3344:103::3]:12221/metrics/collectors/da510a57-3af1-4d2b-b2ed-2e8849f27d8b/producers?sort_by=id_ascending)
    file = oximeter/collector/src/agent.rs:786
08:10:29.750Z DEBG oximeter (oximeter-agent): removed collection task from set
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    producer_id = 00c20c49-e335-48de-997e-4abc3d2ef253
08:10:29.750Z DEBG oximeter (oximeter-agent): shut down collection task
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    producer_id = 00c20c49-e335-48de-997e-4abc3d2ef253
08:10:29.750Z DEBG oximeter (oximeter-agent): removed collection task from set
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    producer_id = 023ccdfc-6e6d-45c5-ac16-0a05e1a97f63
08:10:29.750Z DEBG oximeter (oximeter-agent): shut down collection task
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    producer_id = 023ccdfc-6e6d-45c5-ac16-0a05e1a97f63
08:10:29.750Z DEBG oximeter (oximeter-agent): removed collection task from set
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    producer_id = 055cc99b-88b8-426c-989f-60922ab4d347
08:10:29.750Z DEBG oximeter (oximeter-agent): shut down collection task
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    producer_id = 055cc99b-88b8-426c-989f-60922ab4d347
08:10:29.750Z DEBG oximeter (oximeter-agent): removed collection task from set
    collector_id = da510a57-3af1-4d2b-b2ed-2e8849f27d8b
    collector_ip = fd00:1122:3344:10a::3
    producer_id = 0c7011f7-a4bf-4daf-90cc-1c2410103300
...

oximeter periodically refreshes its list of producers from Nexus, currently every 15s. Here's the code that actually does that:

https://github.com/oxidecomputer/omicron/blob/0b1d42d59867e9a0108dda6f3206c4753c801842/oximeter/collector/src/agent.rs#L775-L816

That request to Nexus can obviously fail, which we try to handle in the first match arm inside that loop. It logs an error, but does not return early from the function if we fail to get a list of producers. I think that was an attempt to avoid wreaking lots of havoc if we can't get the producers (by continuing to try to get the rest of them), which ultimately wreaks a shit-ton of havoc. If we get an ECONNREFUSED from the calle to Nexus, we'll have an empty map in expected_producers, and we'll go on to delete everyone we've ever loved.

We should definitely not do that. I think that any kind of communication-related error here should really cause us to just bail the whole function. We don't know what information we have from Nexus, or whether it was complete, partial, or entirely empty. We should avoid doing anything -- it's much better that we continue to try to collect from the producers we already know about, even if those will always fail. We should only update our list based on what Nexus tells us when we're reasonably confident we have the full list.