Open seizethedave opened 5 months ago
The code that does the down-scaling is deterministic, so the only thing I can think of is the loading of the runtime config that may go wrong. I wonder if we store the config first for serving the URL and then notify the modules that use it means that there's a short window to be able to get the updated config on the HTTP interface, but it's not yet active? https://github.com/grafana/dskit/blob/a1bba1277f06b41ff14cf54e5d4d31aebc92493b/runtimeconfig/manager.go#L209
Reproduced with this diff:
diff --git a/vendor/github.com/grafana/dskit/runtimeconfig/manager.go b/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
index 84b69de76..e43c136a5 100644
--- a/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
+++ b/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
@@ -206,6 +206,7 @@ func (om *Manager) loadConfig() error {
}
om.configLoadSuccess.Set(1)
+ time.Sleep(1 * time.Second) // give listeners time to register
om.setConfig(cfg)
om.callListeners(cfg)
I wonder if we store the config first for serving the URL and then notify the modules that use it means that there's a short window to be able to get the updated config on the HTTP interface, but it's not yet active?
~I think you are onto it. I see two possibilities:~
These would cause inconsistent updates among listeners, but I don't think these would cause the tests to fail, because I don't think the listeners are in the "update path" of runtime config values like native_histograms_ingestion_enabled
.
{waking up on this Monday morning 😸.}
But the race with the second of sleep you highlighted would also cause it.
I guess your second of sleep just takes longer than the tests care to wait for the new config to be available. 1 second, while an eternity, could also be a source of flakes. I'll just open a PR to crank the wait time to something like ten seconds.
Actually, I don't think that was the problem. Here's the run that failed: https://github.com/grafana/mimir/actions/runs/8974460472/job/24647018591
test.Poll will t.Fatal if the condition isn't met after the 1 second, and that didn't happen on that run.
Describe the bug
This test failed for me in github CI and succeeded on rerun. I was working on something unrelated in a branch whose parent commit on
main
was 5e93125923e94718d9ec1673fae0d3d6a65e0686.