That request to Nexus can obviously fail, which we try to handle in the first match arm inside that loop. It logs an error, but does not return early from the function if we fail to get a list of producers. I think that was an attempt to avoid wreaking lots of havoc if we can't get the producers (by continuing to try to get the rest of them), which ultimately wreaks a shit-ton of havoc. If we get an ECONNREFUSED from the calle to Nexus, we'll have an empty map in expected_producers, and we'll go on to delete everyone we've ever loved.
We should definitely not do that. I think that any kind of communication-related error here should really cause us to just bail the whole function. We don't know what information we have from Nexus, or whether it was complete, partial, or entirely empty. We should avoid doing anything -- it's much better that we continue to try to collect from the producers we already know about, even if those will always fail. We should only update our list based on what Nexus tells us when we're reasonably confident we have the full list.
While debugging #7120, I dug into a bunch of
oximeter
logs. In one of them, I noticed this sequence:oximeter
periodically refreshes its list of producers from Nexus, currently every 15s. Here's the code that actually does that:https://github.com/oxidecomputer/omicron/blob/0b1d42d59867e9a0108dda6f3206c4753c801842/oximeter/collector/src/agent.rs#L775-L816
That request to Nexus can obviously fail, which we try to handle in the first match arm inside that
loop
. It logs an error, but does not return early from the function if we fail to get a list of producers. I think that was an attempt to avoid wreaking lots of havoc if we can't get the producers (by continuing to try to get the rest of them), which ultimately wreaks a shit-ton of havoc. If we get anECONNREFUSED
from the calle to Nexus, we'll have an empty map inexpected_producers
, and we'll go on to delete everyone we've ever loved.We should definitely not do that. I think that any kind of communication-related error here should really cause us to just bail the whole function. We don't know what information we have from Nexus, or whether it was complete, partial, or entirely empty. We should avoid doing anything -- it's much better that we continue to try to collect from the producers we already know about, even if those will always fail. We should only update our list based on what Nexus tells us when we're reasonably confident we have the full list.