grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.17k stars 535 forks source link

Temporarily partial query results on latest data points when increasing shuffle-sharding size #1765

Open pracucci opened 2 years ago

pracucci commented 2 years ago

Describe the bug

I've investigated some query results failures reported by mimir-continuous-test tool and I've found an edge case where Mimir could temporarily return partial query results on the latest data points when a tenant ingesters shard size is increased.

Scenario:

Actual outcome:

Investigation

The issue is due to the fact that applying a change to runtime config is not an atomic operation across multiple replicas (there's ConfigMap update delay + Mimir periodic polling of runtime config). If the change is applied to some distributors before all queriers, the most recent data points are written to new ingesters (because we're increasing the shard size) but queriers are not querying them yet (because they're not aware we increased the shard size yet). This cause partial query results on the most recent data points while the shard size increase is rolling out.

krajorama commented 2 years ago

I'm guessing that we need to make this two phase ? First tell queriers to start asking the new ingesters and then start writing to the ingesters ?

krajorama commented 2 years ago

Or possibly one phase but include a warmup time checkpoint, before which distributors should use the old ring?