grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.51k stars 3.4k forks source link

Short-lived ingester instances will cause other ingesters to fail to start #13262

Open zhangpeijin-milo opened 3 months ago

zhangpeijin-milo commented 3 months ago

Describe the bug We got an ingester instance that cannot be started, we found logs as follow: msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance [10.0.40.84:9095](http://10.0.40.84:9095/) past heartbeat timeout"

"10.0.40.84:9095" was an short-lived ingester instances, it looks like it was not removed from the ring, a very short line can be seen in the image below around "14:53". img_v3_02c0_b7d5f557-76d7-488a-b895-6b67cd945d1g

The last log of ingester "10.0.40.84:9095" is as follows: caller=memberlist_client.go:899 msg="skipped broadcasting CAS update because memberlist KV is shutting down" key=collectors/ringShow context caller=module_service.go:114 msg="module stopped" module=ring caller=lifecycler.go:416 msg="auto-joining cluster after timeout" ring=ingester caller=lifecycler.go:576 msg="instance not found in ring, adding with no tokens" ring=ingester

To Reproduce Steps to reproduce the behavior:

  1. Started Loki with multiple ingesters.
  2. Add a new ingester and delete it soon.

Expected behavior When an ingester is removed, the ring should updates its records and other nodes do not fail to start.

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.