grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.08k stars 522 forks source link

Test flake: TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling #8070

Open seizethedave opened 5 months ago

seizethedave commented 5 months ago

Describe the bug

This test failed for me in github CI and succeeded on rerun. I was working on something unrelated in a branch whose parent commit on main was 5e93125923e94718d9ec1673fae0d3d6a65e0686.

19:19:22 Started distributor
=== RUN   TestDistributor/caching_unmarshal_data_disabled/wrong_labels_order
=== RUN   TestDistributor/caching_unmarshal_data_disabled/dropped_labels
19:19:23 querier: ts=2024-05-06T19:19:23.811704501Z caller=loader.go:120 level=warn msg="bucket index not found" user=e2e-user
=== RUN   TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling
    distributor_test.go:361: 
            Error Trace:    /home/runner/work/mimir/mimir/integration/distributor_test.go:361
            Error:          Not equal: 
                            expected: "histogram_down_scaling_series =>\nCount: 12.000000, Sum: 18.400000, Buckets: [[-16,-4):2 [-4,-1):2 [-1,-0.25):1 [-0.001,0.001]:2 (0.25,1]:1 (1,4]:2 (4,16]:2] @[[171](https://github.com/grafana/mimir/actions/runs/8974460472/job/24647018591?pr=8068#step:15:172)5019559]"
                            actual  : "histogram_down_scaling_series =>\nCount: 12.000000, Sum: 18.400000, Buckets: [[-16,-8):1 [-8,-4):1 [-2,-1):2 [-1,-0.5):1 [-0.001,0.001]:2 (0.5,1]:1 (1,2]:2 (4,8]:1 (8,16]:1] @[1715019559]"

                            Diff:
                            --- Expected
                            +++ Actual
                            @@ -1,2 +1,2 @@
                             histogram_down_scaling_series =>
                            -Count: 12.000000, Sum: 18.400000, Buckets: [[-16,-4):2 [-4,-1):2 [-1,-0.25):1 [-0.001,0.001]:2 (0.25,1]:1 (1,4]:2 (4,16]:2] @[1715019559]
                            +Count: 12.000000, Sum: 18.400000, Buckets: [[-16,-8):1 [-8,-4):1 [-2,-1):2 [-1,-0.5):1 [-0.001,0.001]:2 (0.5,1]:1 (1,2]:2 (4,8]:1 (8,16]:1] @[1715019559]
            Test:           TestDistributor/caching_unmarshal_data_disabled/reduce_native_histogram_buckets_via_down_scaling
krajorama commented 5 months ago

The code that does the down-scaling is deterministic, so the only thing I can think of is the loading of the runtime config that may go wrong. I wonder if we store the config first for serving the URL and then notify the modules that use it means that there's a short window to be able to get the updated config on the HTTP interface, but it's not yet active? https://github.com/grafana/dskit/blob/a1bba1277f06b41ff14cf54e5d4d31aebc92493b/runtimeconfig/manager.go#L209

krajorama commented 5 months ago

Reproduced with this diff:

diff --git a/vendor/github.com/grafana/dskit/runtimeconfig/manager.go b/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
index 84b69de76..e43c136a5 100644
--- a/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
+++ b/vendor/github.com/grafana/dskit/runtimeconfig/manager.go
@@ -206,6 +206,7 @@ func (om *Manager) loadConfig() error {
        }
        om.configLoadSuccess.Set(1)

+       time.Sleep(1 * time.Second) // give listeners time to register
        om.setConfig(cfg)
        om.callListeners(cfg)
seizethedave commented 5 months ago

I wonder if we store the config first for serving the URL and then notify the modules that use it means that there's a short window to be able to get the updated config on the HTTP interface, but it's not yet active?

~I think you are onto it. I see two possibilities:~

  1. ~The single goroutine per module isn't done processing the updated config.~
  2. ~(a further symptom of number 1) If the reader side is behind on processing updates, the write side will discard an update.~

These would cause inconsistent updates among listeners, but I don't think these would cause the tests to fail, because I don't think the listeners are in the "update path" of runtime config values like native_histograms_ingestion_enabled.

seizethedave commented 5 months ago

{waking up on this Monday morning 😸.}

But the race with the second of sleep you highlighted would also cause it.

seizethedave commented 5 months ago

I guess your second of sleep just takes longer than the tests care to wait for the new config to be available. 1 second, while an eternity, could also be a source of flakes. I'll just open a PR to crank the wait time to something like ten seconds.

seizethedave commented 5 months ago

Actually, I don't think that was the problem. Here's the run that failed: https://github.com/grafana/mimir/actions/runs/8974460472/job/24647018591

test.Poll will t.Fatal if the condition isn't met after the 1 second, and that didn't happen on that run.