Closed smklein closed 2 weeks ago
I hit this locally and looked into it for a bit. I think this might be a bug / race in Oximeter producer assignment. There are two ways Oximeter decides what its producers are:
Oximeter logs all of these; snipping out just the relevant bits from a failed test:
13:42:53.584Z INFO test_mgs_metrics (oximeter): refreshing list of producers from Nexus
collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
collector_ip = ::1
file = oximeter/collector/src/agent.rs:761
13:42:53.622Z DEBG test_mgs_metrics (oximeter): registered new metric producer
address = [::1]:44257
collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
collector_ip = ::1
producer_id = a6458b7d-87c3-4483-be96-854d814c20de
13:42:53.658Z DEBG test_mgs_metrics (oximeter): registered new metric producer
address = [::1]:40449
collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
collector_ip = ::1
producer_id = 47991ac2-392b-4188-bf19-96e0f0440d14
13:42:53.678Z INFO test_mgs_metrics (oximeter): refreshed list of producers from Nexus
collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
collector_ip = ::1
file = oximeter/collector/src/agent.rs:809
n_current_tasks = 1
n_pruned_tasks = 1
13:42:54.442Z DEBG test_mgs_metrics (oximeter): registered new metric producer
address = 127.0.0.1:41885
collector_id = 39e6175b-4df2-4730-b11d-cbc1e60a2e78
collector_ip = ::1
producer_id = e559cb0e-f223-4273-a733-d5f31b1be530
So I think what happened was:
a6458b7d-87c3-4483-be96-854d814c20de
; this isn't MGS).47991ac2-392b-4188-bf19-96e0f0440d14
; this one is MGS).n_current_tasks = 1
), so it pruned the other one it new about (MGS, because we got unlucky from timing or UUID ordering or something).e559cb0e-f223-4273-a733-d5f31b1be530
; also not MGS).At this point Oximeter is aware of 2 collectors but pruned the third, and the one this test in particular cares about. It would have corrected itself the next time the refresh loop ran, but that's 10 minutes away, and the test times out after 1.
Thanks @jgallagher for the analysis, I think that is all consistent with the behavior we see here.
One possible solution is to keep track of the time we start a refresh and the time every producer is stored. When we complete the refresh and update the internal set of producers, we prune producers that are in our map but not the refreshed list, except those which we inserted after we started the refresh. I think that should ensure that this exact sequence is handled correctly, but we might need to think more carefully about other possible races.
@jgallagher and I talked in chat a bit about this, and it might be clearer to use generation numbers. We would really need two:
As producers are POST'd to oximeter, we assign them the current collection generation. When oximeter starts to refresh its list, it first records the current generation number of the collection task set. It pulls its entire list, and then calls into ensure_producers()
with that generation number it recorded. As it iterates through the list, it mostly operates in the same way it does today, except that it does not prune any producers with a later generation number than was passed into the call.
That guarantees that producers which Nexus sent us between when we recorded the generation and started our own refresh are not pruned. Those would have a later generation number.
@jgallagher and I talked in chat a bit about this, and it might be clearer to use generation numbers. We would really need two:
- a generation number on the whole set of collection tasks
- a per-producer generation number that records the generation of the set in which the producer was added or updated
As producers are POST'd to oximeter, we assign them the current collection generation. When oximeter starts to refresh its list, it first records the current generation number of the collection task set. It pulls its entire list, and then calls into
ensure_producers()
with that generation number it recorded. As it iterates through the list, it mostly operates in the same way it does today, except that it does not prune any producers with a later generation number than was passed into the call.That guarantees that producers which Nexus sent us between when we recorded the generation and started our own refresh are not pruned. Those would have a later generation number.
Epoch based reclamation ;)
I saw a different test (integration_tests::oximeter::test_oximeter_reregistration
) fail on #6913: https://buildomat.eng.oxide.computer/wg/0/details/01JAXAGXWPE822857D5D1DP0N4/DMz6rNUPg3AtDtKClnW7rWsxzHSSFztK1IEKxJe1w21To9FJ/01JAXAH9TM5NJ0PDYF6RSJXV5H#S5672
This sure feels related to me, so I thought I'd mention it here instead of making fresh issue for it.
Closed by #6926
This test failed on a CI run on https://github.com/oxidecomputer/omicron/pull/6881
https://github.com/oxidecomputer/omicron/pull/6881/checks?check_run_id=31702768426
Link here to the specific line of output from the buildomat log showing the failure: https://buildomat.eng.oxide.computer/wg/0/details/01JAE89628KSREHPCPZHC8DVS0/TV1XNO9ztAlr3QY3XQh3zGrYKkUsoCJHt0VvRzVuqCqsj1Ab/01JAE89QQBY3BRXDXDZ6N76977#S6309
Excerpt from the log showing the failure: