grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.81k stars 3.43k forks source link

autoforget_unhealthy isn't working as expected for Ingesters. #6407

Open sharathfeb12 opened 2 years ago

sharathfeb12 commented 2 years ago

Describe the bug I have enabled autoforget_unhealthy for ingesters. When ingester pod starts running, it mentions the same.

level=info ts=2022-06-16T02:27:15.182820969Z caller=ingester.go:308 msg="autoforget is enabled and will remove unhealthy instances from the ring after 1m0s with no heartbeat"

It then complains that there is an instance with problem and asks me to manually cleanup on /ring endpoint.

level=warn ts=2022-06-16T02:27:45.421965683Z caller=lifecycler.go:245 msg="found an existing instance(s) with a problem in the ring, this instance cannot become ready until this problem is resolved. The /ring http endpoint on the distributor (or single binary) provides visibility into the ring." ring=ingester err="instance 172.20.15.152:9095 past heartbeat timeout"

To Reproduce Steps to reproduce the behavior: Restarted ingesters after setting autoforget_unhealthy flag to true.

Expected behavior Expected the unhealthy ingesters to be cleaned automatically.

Environment:

Config:

apiVersion: v1
data:
  config.yaml: |-
    "auth_enabled": false
    "compactor":
      "compaction_interval": "10m"
      "shared_store": "gcs"
      "working_directory": "/data/loki/compactor"
      "retention_enabled": true
    "distributor":
      "ring":
        "kvstore":
          "store": "memberlist"
    "frontend":
      "compress_responses": false
      "max_outstanding_per_tenant": 2048
      "tail_proxy_url": "http://querier.logs.svc.cluster.local:3100"
    "frontend_worker":
      "frontend_address": "queryfrontend.logs.svc.cluster.local:9095"
      "grpc_client_config":
        "max_send_msg_size": 1104857600
      "parallelism": 256
    "ingester":
      "chunk_block_size": 262144
      "chunk_target_size": 1536000
      "chunk_encoding": "snappy"
      "chunk_idle_period": "30m"
      "autoforget_unhealthy": true
      "lifecycler":
        "heartbeat_period": "1m"
        "interface_names":
        - "eth0"
        "num_tokens": 512
        "ring":
          "kvstore":
            "store": "memberlist"
          "heartbeat_timeout": "1m"
          "replication_factor": 1
      "max_transfer_retries": 0
      "wal":
        "enabled": true
        "dir": "data"
    "ingester_client":
      "grpc_client_config":
        "max_recv_msg_size": 1104857600
        "max_send_msg_size": 1104857600
        "backoff_on_ratelimits": true
        "backoff_config":
          "min_period": "1s"
          "max_period": "32s"
          "max_retries": 10
      "remote_timeout": "10s"
    "limits_config":
      "enforce_metric_name": false
      "ingestion_burst_size_mb": 512
      "ingestion_rate_mb": 256
    "memberlist":
      "bind_port": 7946
      "join_members":
      - "gossip-ring.logs.svc.cluster.local:7946"
      "max_join_backoff": "1m"
      "max_join_retries": 10
      "min_join_backoff": "1s"
    "querier":
      "engine":
        "timeout": "15m"
      "extra_query_delay": "0s"
    "query_range":
      "align_queries_with_step": true
      "cache_results": true
      "max_retries": 0
      "parallelise_shardable_queries": true
      "results_cache":
        "cache":
          "memcached":
            "expiration": "10800s"
            "batch_size": 1024
            "parallelism": 300
          "memcached_client":
            "host": "memcached-frontend.logs.svc"
            "service": "memcached"
    "schema_config":
      "configs":
      - "from": "2020-10-01"
        "index":
          "period": "24h"
          "prefix": "loki_index_"
        "object_store": "gcs"
        "schema": "v11"
        "store": "boltdb-shipper"
    "server":
      "graceful_shutdown_timeout": "5s"
      "grpc_server_max_concurrent_streams": 1000
      "grpc_server_max_recv_msg_size": 1104857600
      "grpc_server_max_send_msg_size": 1104857600
      "http_listen_port": 3100
      "http_server_idle_timeout": "3m"
      "http_server_write_timeout": "1m"
      "http_server_read_timeout": "15m"
    "storage_config":
      "boltdb_shipper":
        "active_index_directory": "/data/loki/index"
        "cache_location": "/data/loki/index_cache"
        "cache_ttl": "24h"
        "query_ready_num_days": 5
        "resync_interval": "5m"
        "shared_store": "gcs"
        "index_gateway_client":
          "server_address": "dns:///indexgateway:9095"
      "gcs":
        "bucket_name": cdl-logs
      "index_queries_cache_config":
        "memcached":
          "expiration": "43200s"
          "batch_size": 3096
          "parallelism": 256
        "memcached_client":
          "host": "memcached-index-queries.logs.svc"
          "service": "memcached"
    "chunk_store_config":
      "chunk_cache_config":
        "memcached":
          "expiration": "3600s"
          "batch_size": 3096
          "parallelism": 256
        "memcached_client":
          "host": "memcached-chunks.cdl-logs.svc"
          "service": "memcached"
  overrides.yaml: '{}'
kind: ConfigMap
metadata:
  annotations:
    meta.helm.sh/release-name: loki
  labels:
    app.kubernetes.io/instance: loki
    app.kubernetes.io/managed-by: Helm
  name: loki
ahmedzidan commented 1 year ago

I have the same issue, when ingester scale down, it doesn't leave the ring and it marked as unhealthy in the ring.

kirankh7 commented 1 year ago

+1 on the same issue we are also facing similar issues

level=warn ts=2023-09-25T17:13:55.410815291Z caller=logging.go:86 traceID=abc orgID=fake msg="POST /loki/api/v1/push (500) 204.393µs Response: \"at least 2 live replicas required, could only find 1 - unhealthy instances: 1.2.3.4:9095\\n\" ws: false; Content-Length: 6243; Content-Type: application/x-protobuf; User-Agent: promtail/2.8.4; X-B3-Parentspanid: addd; X-B3-Sampled: 0; X-B3-Spanid: 8cf855c24430fce7; X-B3-Traceid: sss; X-Envoy-Attempt-Count: 1; X-Envoy-External-Address: 136.147.62.8; X-Forwarded-Client-Cert: Hash=abc;Cert=\"-----BEGIN%20CERTIFICATE-----
ec-appsoss commented 1 year ago

+1 on this.. I have also made the following changes. Just for context, we are using Loki OSS, running on 3 replicas with a replication factor of 2.

memberlist.rejoin_interval: 30s
wal.enabled: false
chakri553 commented 4 months ago

+1 on this. I am using loki-distributed . Do we have a fix for this ?

│ level=warn ts=2024-06-25T08:28:15.471738932Z caller=logging.go:123 traceID=49e7eefb1868e86a orgID=fake msg="POST /loki/api/v1/push (500) 401.433µs Response: \"at least 1 live replicas required, could only find 0 - unhealthy instances: 172.39.3.250:9095\\n\" ws: false; Accept: */*; Connection: close; Content-Length: 311; Content-Type: application/json; User-Agent: curl/7.81.0; " │