Partial query results cached in the results cache after restart of ingesters hash ring backend

pracucci commented 2 years ago

Describe the bug

I've investigated some query results failures reported by mimir-continuous-test tool and I've found an edge case where Mimir could return and cache partial query results right after a restart of the ingesters hash ring backend (Consul or etcd). This bug shouldn't happen when using memberlist.

Scenario:

Ingesters hash ring backed by Consul
Consul is running with no persistent storage
Consul is restarted and its storage is empty right after startup
Run a range query right after Consul is restarted

Actual outcome:

They query result may contain partial series for any data point fetched from ingesters only
The partial query result could be cached in the results cache too, for any sample with timestamp older than 10m

Investigation

Ingesters periodically heartbeat the ring. After Consul restart, during the periodic heartbeat, each ingester find it's not registered anymore in the ring it registers itself. Each ingester register itself at a slightly different time (happening within the configured heartbeat period). Any query running after Consul restart and before all ingesters have registered themselves back to the ring will return partial query results because queriers will only query ingesters that registered to the ring until then.

A similar issue has been discussed in https://github.com/grafana/mimir/issues/344. Fix for https://github.com/grafana/mimir/issues/344 could also fix this issue.

pracucci commented 2 years ago

Any query running after Consul restart and before all ingesters have registered themselves back to the ring will return partial query results because queriers will only query ingesters that registered to the ring until then.

I just investigate another case where Consul was restarted, and there's more.

When ingesters progressively add themself back to the ring, tenants for which shuffle sharding is enabled will get resharded across ingesters. Queriers will lookup only the current shards for a given tenant, but series have been written to other ingesters too until all ingesters re-registered to the ring.

Queriers will lookup only the current shards for a given tenant

This happens because when ingesters re-register themselves to the ring they use the original "registration timestamp" instead of updating it to "now", as if they were just added to the cluster for the first time: https://github.com/grafana/dskit/blob/8f57b58632653c7536a333faa84fe6e9c856d3ee/ring/lifecycler.go#L736-L740

An option to prevent this specific issue, could be updating the "registration timestamp" to "now" when an instance is added back to the ring.

With the proposed change, it would act as if the re-registered ingesters just joined the cluster. Basically, after Consul restart, queriers will query all ingesters for every tenant for the next -querier.shuffle-sharding-ingesters-lookback-period after Consul restart. It's higher load on ingesters, but should guarantee query correctness in the described case.

pracucci commented 2 years ago

When ingesters progressively add themself back to the ring, tenants for which shuffle sharding is enabled will get resharded across ingesters. Queriers will lookup only the current shards for a given tenant, but series have been written to other ingesters too until all ingesters re-registered to the ring.

The case described in this comment should be fixed by https://github.com/grafana/mimir/pull/1829.

The case described in the issue description is still valid.

grafana / mimir

Partial query results cached in the results cache after restart of ingesters hash ring backend #1766

Describe the bug

Investigation