grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
24.04k stars 3.47k forks source link

Tempo Ingesters register to lokis ring #10172

Open jlynch93 opened 1 year ago

jlynch93 commented 1 year ago

Describe the bug Tempo Ingesters registerd to lokis ingester ring which caused loki to go down and stop returning logs.

To Reproduce Steps to reproduce the behavior: Unsure of how to reproduce this issue as it has never happened in our current deployment before.

Expected behavior Loki ingesters should register to loki and tempo ingesters should register to tempo.

Environment: Current deployed is using tempo-distributed helm chart into eks. attached is the loki config

auth_enabled: false
chunk_store_config:
  chunk_cache_config:
    embedded_cache:
      enabled: true
      ttl: 24h
common:
  compactor_address: http://loki-loki-distributed-compactor:3100
compactor:
  compaction_interval: 10m
  deletion_mode: filter-and-delete
  retention_delete_delay: 10m
  retention_delete_worker_count: 150
  retention_enabled: true
  shared_store: s3
distributor:
  ring:
    kvstore:
      store: memberlist
frontend:
  compress_responses: true
  log_queries_longer_than: 0
  scheduler_address: loki-loki-distributed-query-scheduler:9095
  tail_proxy_url: http://loki-loki-distributed-querier:3100
frontend_worker:
  grpc_client_config:
    max_recv_msg_size: 1048576000
    max_send_msg_size: 1677721600
  match_max_concurrent: false
  parallelism: 500
  scheduler_address: loki-loki-distributed-query-scheduler:9095
ingester:
  autoforget_unhealthy: true
  chunk_encoding: snappy
  chunk_idle_period: 30m
  chunk_target_size: 262144
  lifecycler:
    ring:
      heartbeat_timeout: 0
      kvstore:
        store: memberlist
      replication_factor: 1
  max_chunk_age: 24h
  max_transfer_retries: 0
  query_store_max_look_back_period: 0
  sync_min_utilization: 0.5
  wal:
    dir: /var/loki/wal
ingester_client:
  pool_config:
    remote_timeout: 10s
  remote_timeout: 60s
limits_config:
  cardinality_limit: 1000000
  ingestion_burst_size_mb: 20000
  ingestion_rate_mb: 1000
  max_cache_freshness_per_query: 10m
  max_concurrent_tail_requests: 200
  max_entries_limit_per_query: 5000000
  max_global_streams_per_user: 0
  max_query_length: 0
  max_query_series: 500000
  max_streams_per_user: 0
  per_stream_rate_limit: 1000MB
  per_stream_rate_limit_burst: 20000MB
  reject_old_samples: false
  reject_old_samples_max_age: 168h
  retention_period: 365d
  split_queries_by_interval: 24h
memberlist:
  join_members:
  - loki-loki-distributed-memberlist.grafana-loki.svc.cluster.local
  randomize_node_name: false
querier:
  engine:
    max_look_back_period: 60m
    timeout: 60m
  max_concurrent: 500000
  query_ingester_only: false
  query_store_only: false
query_range:
  align_queries_with_step: true
  cache_results: true
  max_retries: 5
  results_cache:
    cache:
      default_validity: 24h
      embedded_cache:
        enabled: true
        ttl: 1h
      enable_fifocache: true
      fifocache:
        max_size_bytes: 10GB
        max_size_items: 0
        validity: 24h
query_scheduler:
  max_outstanding_requests_per_tenant: 500
  scheduler_ring:
    heartbeat_period: 0
    heartbeat_timeout: 0
    kvstore:
      store: memberlist
  use_scheduler_ring: false
ruler:
  alertmanager_url: http://prometheus-kube-prometheus-alertmanager.prometheus.svc:9093
  enable_api: true
  external_url: http://prometheus-kube-prometheus-alertmanager.prometheus.svc:9093
  ring:
    heartbeat_period: 0
    heartbeat_timeout: 0
    kvstore:
      store: inmemory
  rule_path: /opt/loki/ruler/scratch
  storage:
    local:
      directory: /opt/loki/ruler/rules
    type: local
schema_config:
  configs:
  - from: "2020-09-07"
    index:
      period: 24h
      prefix: loki_index_
    object_store: aws
    schema: v11
    store: boltdb-shipper
server:
  grpc_server_max_concurrent_streams: 0
  grpc_server_max_recv_msg_size: 419430400
  grpc_server_max_send_msg_size: 419430400
  http_listen_port: 3100
  http_server_idle_timeout: 15m
  http_server_read_timeout: 15m
  http_server_write_timeout: 15m
  log_level: info
storage_config:
  aws:
    backoff_config:
      max_period: 15s
      max_retries: 15
      min_period: 100ms
    bucketnames: HIDDEN
    http_config:
      idle_conn_timeout: 20m
    region: us-east-1
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 24h
    query_ready_num_days: 1
    resync_interval: 5m
    shared_store: s3
  index_queries_cache_config: null
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

Loki gateway nginx config

worker_processes  5;  ## Default: 1
error_log  /dev/stderr;
pid        /tmp/nginx.pid;
worker_rlimit_nofile 8192;
events {
  worker_connections  4096;  ## Default: 1024
}
http {
  proxy_read_timeout 90000s;
  proxy_connect_timeout 90000s;
  proxy_send_timeout 90000s;
  fastcgi_read_timeout 90000s;
  client_body_temp_path /tmp/client_temp;
  proxy_temp_path       /tmp/proxy_temp_path;
  fastcgi_temp_path     /tmp/fastcgi_temp;
  uwsgi_temp_path       /tmp/uwsgi_temp;
  scgi_temp_path        /tmp/scgi_temp;
  default_type application/octet-stream;
  log_format   main '$remote_addr - $remote_user [$time_local]  $status '
        '"$request" $body_bytes_sent "$http_referer" '
        '"$http_user_agent" "$http_x_forwarded_for"';
  access_log   /dev/stderr  main;
  sendfile     on;
  tcp_nopush   on;
  resolver kube-dns.kube-system.svc.cluster.local;
  server {
    listen             8080;
    location = / {
      return 200 'OK';
      auth_basic off;
    }
    location = /api/prom/push {
      proxy_pass       http://loki-loki-distributed-distributor.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location = /api/prom/tail {
      proxy_pass       http://loki-loki-distributed-querier.grafana-loki.svc.cluster.local:3100$request_uri;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    }
    # Ruler
    location ~ /prometheus/api/v1/alerts.* {
      proxy_pass       http://loki-loki-distributed-ruler.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location ~ /prometheus/api/v1/rules.* {
      proxy_pass       http://loki-loki-distributed-ruler.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location ~ /api/prom/rules.* {
      proxy_pass       http://loki-loki-distributed-ruler.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location ~ /api/prom/alerts.* {
      proxy_pass       http://loki-loki-distributed-ruler.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location ~ /api/prom/.* {
      proxy_pass       http://loki-loki-distributed-query-frontend.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location = /loki/api/v1/push {
      proxy_pass       http://loki-loki-distributed-distributor.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location = /loki/api/v1/tail {
      proxy_pass       http://loki-loki-distributed-querier.grafana-loki.svc.cluster.local:3100$request_uri;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    }
    location ~ /loki/api/.* {
      proxy_pass       http://loki-loki-distributed-query-frontend.grafana-loki.svc.cluster.local:3100$request_uri;
    }
  }
}

Screenshots, Promtail config, or terminal output Only log line that directed us to the issue was: level=warn ts=2023-08-04T14:32:18.386282517Z caller=logging.go:86 traceID=54e1a62fbdffbc09 orgID=fake msg="POST /loki/api/v1/push (500) 4.35479ms Response: \"rpc error: code = Unimplemented desc = unknown service logproto.Pusher\\n\" ws: false; Connection: close; Content-Length: 177219; Content-Type: application/x-protobuf; User-Agent: promtail/2.6.1; "

jlynch93 commented 1 year ago

Created the same issue in Tempo as well: https://github.com/grafana/tempo/issues/2766. You can find the Tempo configs there as well!

pawankkamboj commented 1 year ago

We also faced the same and faced 3 times till now in last 1 years.

mzupan commented 1 year ago

I've dealt with this randomly. What i found worked was a strong hostname for the join and a prefix like this

    memberlistConfig:
      cluster_label: loki-dev
      join_members:
        - loki-memberlist.loki-dev.svc.cluster.local:7946

On the mimir side you can do the same thing.. pretty sure you can with tempo also

  memberlist:
        cluster_label: mimir
        join_members:
          - dns+{{ include "mimir.fullname" . }}-gossip-ring.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.memberlistBindPort" . }}
MikaelElkiaer commented 10 months ago

I've dealt with this randomly. What i found worked was a strong hostname for the join and a prefix like this

    memberlistConfig:
      cluster_label: loki-dev
      join_members:
        - loki-memberlist.loki-dev.svc.cluster.local:7946

On the mimir side you can do the same thing.. pretty sure you can with tempo also

  memberlist:
        cluster_label: mimir
        join_members:
          - dns+{{ include "mimir.fullname" . }}-gossip-ring.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.memberlistBindPort" . }}

Wow, this seems to do the trick, thanks!

But what a mess, how can this gotcha not be clearly documented somewhere? Before finding this issue comment, I saw https://github.com/grafana/loki/issues/10537 which was not very helpful. This https://github.com/grafana/mimir/issues/2865 pointed me in the right direction to finding this issue.

Edit: Spoke too soon, the problem persists...

Edit 2: Not sure if fixed or not. At least I have not seen the error for a day now. Seemed that it took a greater part of the weekend to stabilize.