I'm evaluating mimir-distributed in high availability mode to determine its reliability when one of the nodes is offline. Following a series of bring-ups and down operations, I discovered that the solution ceased functioning properly.
To Reproduce
Steps to reproduce the behavior:
Deploy mimir-distributed using HA config in a 3 nodes K8S cluster (current helm chart version = 5.1.4)
Bring down one of the nodes and wait for pods to reschedule
After some time, check if any metrics are ingested
Expected behavior
Mimir-distributed should still work if one of the nodes is down
Environment
Infrastructure: Kubernetes
Deployment tool: Helm / ArgoCD
Additional Context
I managed to replicate this issue after a few shutdown attempts. Upon investigation, I observed that following a node shutdown, mimir-querier fails to update the list of endpoints for connection. It persistently attempts to connect to the old endpoints unless the pod is restarted. Notably, the endpoint is only skipped if a "no route to host" error is received; otherwise, even with at least one healthy endpoint available, the querier still try to access an endpoint where it receives "timeout".
mimir-querier-69b6848c4f-dsqb8
ts=2024-03-06T14:22:26.150122014Z caller=scheduler_processor.go:117 level=warn msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.1.12:9095: connect: no route to host\"" addr=10.0.1.12:9095 (does not exist anymore after node stop/start, but still present in the list)
--------------------------------
ts=2024-03-06T17:06:33.404993635Z caller=scheduler_processor.go:117 level=warn msg="error contacting scheduler" err="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.2.47:9095: i/o timeout\"" addr=10.0.2.47:9095 (node is down, kubernetes will show the pod as terminating)
--------------------------------
10.0.0.54 never accesed
test mimir-query-scheduler-6467c94cc6-fb2n9 1/1 Running 0 149m 10.0.1.58 node-3 <none> <none>
test mimir-query-scheduler-6467c94cc6-pwlr5 1/1 Terminating 0 176m 10.0.2.47 node-2 <none> <none>
test mimir-query-scheduler-6467c94cc6-smnqv 1/1 Running 0 3h15m 10.0.0.54 node-1 <none> <none>
Is important to note that a similar issue was found on loki-distributed and the solution was to restart the querier pod, which is not expected in a production environment. Another way was to start the node that was stopped.
when you say malfunction, what do you exactly mean? Do you mean the error logs of the querier pod? Or did you observe failing queries or other timeouts?
Describe the bug
I'm evaluating mimir-distributed in high availability mode to determine its reliability when one of the nodes is offline. Following a series of bring-ups and down operations, I discovered that the solution ceased functioning properly.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Mimir-distributed should still work if one of the nodes is down
Environment
Additional Context
I managed to replicate this issue after a few shutdown attempts. Upon investigation, I observed that following a node shutdown, mimir-querier fails to update the list of endpoints for connection. It persistently attempts to connect to the old endpoints unless the pod is restarted. Notably, the endpoint is only skipped if a "no route to host" error is received; otherwise, even with at least one healthy endpoint available, the querier still try to access an endpoint where it receives "timeout".
mimir-querier-69b6848c4f-dsqb8
Is important to note that a similar issue was found on loki-distributed and the solution was to restart the querier pod, which is not expected in a production environment. Another way was to start the node that was stopped.