grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.16k stars 535 forks source link

Ingester should not go ready if it can't resolve memberlist #1511

Open bboreham opened 2 years ago

bboreham commented 2 years ago

Describe the bug

All my ingesters are reporting "ready", despite the fact that none of them could resolve the memberlist endpoints hence the ring is empty.

To Reproduce

  1. Have Mimir configured to discover gossip members via DNS
  2. Remove the DNS name, e.g. by kubectl delete service xyz

Expected behavior

Not to go 'Ready'

Environment

Additional Context

level=info ts=2022-03-18T17:18:19.743897039Z caller=main.go:193 msg="Starting application" version="(version=2.0.0-rc.2, branch=HEAD, revision=c0c349e)"
level=info ts=2022-03-18T17:18:19.744555168Z caller=server.go:285 http=[::]:8080 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2022-03-18T17:18:19.748669166Z caller=ingester.go:321 msg="TSDB idle compaction timeout set" timeout=1h13m17.07645809s
level=info ts=2022-03-18T17:18:19.749009942Z caller=memberlist_client.go:394 msg="Using memberlist cluster node name" name=cortex-ingester-0-98ef3ae8
level=info ts=2022-03-18T17:18:19.749256586Z caller=module_service.go:64 msg=initialising module=sanity-check
level=info ts=2022-03-18T17:18:19.749383072Z caller=sanity_check.go:33 msg="Checking directories read/write access"
level=info ts=2022-03-18T17:18:19.754607183Z caller=sanity_check.go:38 msg="Directories read/write access successfully checked"
level=info ts=2022-03-18T17:18:19.754677074Z caller=sanity_check.go:40 msg="Checking object storage config"
ts=2022-03-18T17:18:19.801313392Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve cortex-gossip-ring: lookup cortex-gossip-ring on 10.8.0.10:53: no such host"
level=info ts=2022-03-18T17:18:19.863035637Z caller=sanity_check.go:45 msg="Object storage config successfully checked"
level=info ts=2022-03-18T17:18:19.863264854Z caller=module_service.go:64 msg=initialising module=server
level=info ts=2022-03-18T17:18:19.863613824Z caller=module_service.go:64 msg=initialising module=memberlist-kv
level=info ts=2022-03-18T17:18:19.863845808Z caller=module_service.go:64 msg=initialising module=runtime-config
level=info ts=2022-03-18T17:18:19.8642321Z caller=module_service.go:64 msg=initialising module=ingester-service
level=info ts=2022-03-18T17:18:19.864358947Z caller=ingester.go:1586 msg="opening existing TSDBs"
[...]
ts=2022-03-18T17:18:21.409580175Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve cortex-gossip-ring: lookup cortex-gossip-ring on 10.8.0.10:53: no such host"
level=info ts=2022-03-18T17:18:22.160518178Z caller=head.go:536 org_id=fake msg="Replaying on-disk memory mappable chunks if any"
level=info ts=2022-03-18T17:18:22.160736101Z caller=head.go:570 org_id=fake msg="On-disk memory mappable chunks replay completed" duration=155.729µs
level=info ts=2022-03-18T17:18:22.160805672Z caller=head.go:576 org_id=fake msg="Replaying WAL, this may take a while"
ts=2022-03-18T17:18:24.917933807Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve cortex-gossip-ring: lookup cortex-gossip-ring on 10.8.0.10:53: no such host"
ts=2022-03-18T17:18:32.693461311Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve cortex-gossip-ring: lookup cortex-gossip-ring on 10.8.0.10:53: no such host"
[...]
level=info ts=2022-03-18T17:18:37.237678568Z caller=ingester.go:1679 msg="successfully opened existing TSDBs"
level=info ts=2022-03-18T17:18:37.237778779Z caller=lifecycler.go:546 msg="not loading tokens from file, tokens file path is empty"
level=info ts=2022-03-18T17:18:37.237929464Z caller=lifecycler.go:575 msg="instance not found in ring, adding with no tokens" ring=ingester
level=info ts=2022-03-18T17:18:37.238128977Z caller=mimir.go:459 msg="Application started"
level=info ts=2022-03-18T17:18:37.238354661Z caller=lifecycler.go:422 msg="auto-joining cluster after timeout" ring=ingester
ts=2022-03-18T17:18:48.318970445Z caller=memberlist_logger.go:74 level=warn msg="Failed to resolve cortex-gossip-ring: lookup cortex-gossip-ring on 10.8.0.10:53: no such host"
pstibrany commented 2 years ago

It is possible to achieve this by using -memberlist.abort-if-join-fails=true (which is default value).