Temporarily partial query results on latest data points when increasing shuffle-sharding size

pracucci commented 2 years ago

Describe the bug

I've investigated some query results failures reported by mimir-continuous-test tool and I've found an edge case where Mimir could temporarily return partial query results on the latest data points when a tenant ingesters shard size is increased.

Scenario:

Mimir deployed with jsonnet (runtime config deployed as ConfigMap)
Ingesters shard size for a tenant is increased
Run a query while the shard size increase change is rolling out

Actual outcome:

The query result may contain partial series for the most recent data points
Since the issue should happen only on the most recent data points, the partial query result should be never cached to the results cache (we cache only samples with timestamp older than 10m)

Investigation

The issue is due to the fact that applying a change to runtime config is not an atomic operation across multiple replicas (there's ConfigMap update delay + Mimir periodic polling of runtime config). If the change is applied to some distributors before all queriers, the most recent data points are written to new ingesters (because we're increasing the shard size) but queriers are not querying them yet (because they're not aware we increased the shard size yet). This cause partial query results on the most recent data points while the shard size increase is rolling out.

krajorama commented 2 years ago

I'm guessing that we need to make this two phase ? First tell queriers to start asking the new ingesters and then start writing to the ingesters ?

krajorama commented 2 years ago

Or possibly one phase but include a warmup time checkpoint, before which distributors should use the old ring?

grafana / mimir

Temporarily partial query results on latest data points when increasing shuffle-sharding size #1765

Describe the bug

Investigation