Open pracucci opened 2 years ago
Any query running after Consul restart and before all ingesters have registered themselves back to the ring will return partial query results because queriers will only query ingesters that registered to the ring until then.
I just investigate another case where Consul was restarted, and there's more.
When ingesters progressively add themself back to the ring, tenants for which shuffle sharding is enabled will get resharded across ingesters. Queriers will lookup only the current shards for a given tenant, but series have been written to other ingesters too until all ingesters re-registered to the ring.
Queriers will lookup only the current shards for a given tenant
This happens because when ingesters re-register themselves to the ring they use the original "registration timestamp" instead of updating it to "now", as if they were just added to the cluster for the first time: https://github.com/grafana/dskit/blob/8f57b58632653c7536a333faa84fe6e9c856d3ee/ring/lifecycler.go#L736-L740
An option to prevent this specific issue, could be updating the "registration timestamp" to "now" when an instance is added back to the ring.
With the proposed change, it would act as if the re-registered ingesters just joined the cluster. Basically, after Consul restart, queriers will query all ingesters for every tenant for the next -querier.shuffle-sharding-ingesters-lookback-period
after Consul restart. It's higher load on ingesters, but should guarantee query correctness in the described case.
When ingesters progressively add themself back to the ring, tenants for which shuffle sharding is enabled will get resharded across ingesters. Queriers will lookup only the current shards for a given tenant, but series have been written to other ingesters too until all ingesters re-registered to the ring.
The case described in this comment should be fixed by https://github.com/grafana/mimir/pull/1829.
The case described in the issue description is still valid.
Describe the bug
I've investigated some query results failures reported by
mimir-continuous-test
tool and I've found an edge case where Mimir could return and cache partial query results right after a restart of the ingesters hash ring backend (Consul or etcd). This bug shouldn't happen when using memberlist.Scenario:
Actual outcome:
Investigation
Ingesters periodically heartbeat the ring. After Consul restart, during the periodic heartbeat, each ingester find it's not registered anymore in the ring it registers itself. Each ingester register itself at a slightly different time (happening within the configured heartbeat period). Any query running after Consul restart and before all ingesters have registered themselves back to the ring will return partial query results because queriers will only query ingesters that registered to the ring until then.
A similar issue has been discussed in https://github.com/grafana/mimir/issues/344. Fix for https://github.com/grafana/mimir/issues/344 could also fix this issue.