grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
3.98k stars 503 forks source link

[mimir-distributed] Unexpected Shutdown Causes Mimir Distributed System in HA Configuration to Malfunction #7554

Open alita1991 opened 6 months ago

alita1991 commented 6 months ago

Describe the bug

I'm evaluating mimir-distributed in high availability mode to determine its reliability when one of the nodes is offline. Following a series of bring-ups and down operations, I discovered that the solution ceased functioning properly.

To Reproduce

Steps to reproduce the behavior:

  1. Deploy mimir-distributed using HA config in a 3 nodes K8S cluster (current helm chart version = 5.1.4)
  2. Bring down one of the nodes and wait for pods to reschedule
  3. After some time, check if any metrics are ingested

Expected behavior

Mimir-distributed should still work if one of the nodes is down

Environment

Additional Context

I managed to replicate this issue after a few shutdown attempts. Upon investigation, I observed that following a node shutdown, mimir-querier fails to update the list of endpoints for connection. It persistently attempts to connect to the old endpoints unless the pod is restarted. Notably, the endpoint is only skipped if a "no route to host" error is received; otherwise, even with at least one healthy endpoint available, the querier still try to access an endpoint where it receives "timeout".

mimir-querier-69b6848c4f-dsqb8

ts=2024-03-06T14:22:26.150122014Z caller=scheduler_processor.go:117 level=warn msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.1.12:9095: connect: no route to host\"" addr=10.0.1.12:9095 (does not exist anymore after node stop/start, but still present in the list)
--------------------------------
ts=2024-03-06T17:06:33.404993635Z caller=scheduler_processor.go:117 level=warn msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.2.47:9095: i/o timeout\"" addr=10.0.2.47:9095 (node is down, kubernetes will show the pod as terminating)
--------------------------------
10.0.0.54 never accesed
test mimir-query-scheduler-6467c94cc6-fb2n9                  1/1     Running       0               149m    10.0.1.58   node-3   <none>           <none>
test   mimir-query-scheduler-6467c94cc6-pwlr5                 1/1     Terminating   0               176m    10.0.2.47   node-2   <none>           <none>
test   mimir-query-scheduler-6467c94cc6-smnqv               1/1     Running       0               3h15m   10.0.0.54   node-1   <none>           <none>

Is important to note that a similar issue was found on loki-distributed and the solution was to restart the querier pod, which is not expected in a production environment. Another way was to start the node that was stopped.

dimitarvdimitrov commented 5 months ago

when you say malfunction, what do you exactly mean? Do you mean the error logs of the querier pod? Or did you observe failing queries or other timeouts?