grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.8k stars 3.43k forks source link

Error 500 empty ring when referring to loki #7614

Open Amfiass opened 1 year ago

Amfiass commented 1 year ago

Hello!

Faced a problem when installing a loki cluster. The cluster consists of 5 nodes: 1) Nginx proxy 2) loki reader x2 3) loki writer x2

All have the same configuration file installed with a difference in the target item:

auth_enabled: false
target: "read"

memberlist:
  node_name: "loki-reader2.example.com"
  join_members:
    - "loki-reader1.example.com"
    - "loki-reader2.example.com"
    - "loki-writer1.example.com"
    - "loki-writer2.example.com"
  dead_node_reclaim_time: "30s"
  gossip_to_dead_nodes_time: "15s"
  left_ingesters_timeout: "30s"
  bind_addr: 
    - "0.0.0.0"
  bind_port: 7946

server:
  grpc_listen_port: 9095
  http_listen_port: 3100
  http_server_read_timeout: "1h"
  http_server_write_timeout: "5m"
  http_server_idle_timeout: "5m"
  grpc_server_max_recv_msg_size: 41943040
  grpc_server_max_send_msg_size: 41943040
  log_level: debug
common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: memberlist

querier:
  max_concurrent: 10
  query_ingesters_within: 5h
  query_timeout: 1h
  engine:
    timeout: 5m

schema_config:
  configs:
    - from: 2022-10-01
      store: boltdb-shipper
      object_store: aws
      schema: v11
      index:
        prefix: index_
        period: 24h

compactor:
  retention_enabled: true
  retention_delete_delay: "1h"
  delete_request_cancel_period: "1h"
  working_directory: /loki/compactor
  shared_store: s3
  retention_delete_worker_count: 10
  compactor_ring:
    kvstore:
      store: memberlist

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/index
    cache_location: /loki/index-cache
    shared_store: s3
  aws:
    bucketnames: loki
    endpoint: https://s3.example.com
    access_key_id: -----------------
    secret_access_key: ----------------------
    insecure: true
    http_config:
      idle_conn_timeout: 5m
      response_header_timeout: 0s
      insecure_skip_verify: true
    s3forcepathstyle: true

limits_config:
  retention_period: "4320h"
  per_stream_rate_limit: "500MB"
  per_stream_rate_limit_burst: "1000MB"
  max_concurrent_tail_requests: 10000
  ingestion_rate_mb: 100
  ingestion_burst_size_mb: 100
  max_entries_limit_per_query: 100000

ingester:
  lifecycler:
    join_after: "60s"
    observe_period: "5s"
    ring:
      replication_factor: 1
      kvstore:
        store: memberlist
    final_sleep: "0s"
  max_chunk_age: "240h"

All healthchecks pass, the service starts. According to the logs, Loki was able to locate all the members of the ring. But when trying to connect the cluster to grafana, a 500 error:

Loki: Internal Server Error. 500. empty ring

In the service logs I see the following:

ts=2022-11-07T13:12:13.926196155Z caller=spanlogger.go:80 user=fake method=query.Label level=debug Ingester.TotalReached=0 Ingester.TotalChunksMatched=0 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.TotalChunksRef=0 Ingester.TotalChunksDownloaded=0 Ingester.ChunksDownloadTime=0s Ingester.HeadChunkBytes="0 B" Ingester.HeadChunkLines=0 Ingester.DecompressedBytes="0 B" Ingester.DecompressedLines=0 Ingester.CompressedBytes="0 B" Ingester.TotalDuplicates=0 Querier.TotalChunksRef=0 Querier.TotalChunksDownloaded=0 Querier.ChunksDownloadTime=0s Querier.HeadChunkBytes="0 B" Querier.HeadChunkLines=0 Querier.DecompressedBytes="0 B" Querier.DecompressedLines=0 Querier.CompressedBytes="0 B" Querier.TotalDuplicates=0
ts=2022-11-07T13:12:13.927323183Z caller=spanlogger.go:80 user=fake method=query.Label level=debug Summary.BytesProcessedPerSecond="0 B" Summary.LinesProcessedPerSecond=0 Summary.TotalBytesProcessed="0 B" Summary.TotalLinesProcessed=0 Summary.ExecTime=35.563µs Summary.QueueTime=0s
ts=2022-11-07T13:12:13.927996495Z caller=spanlogger.go:80 user=fake method=query.Label level=info org_id=fake latency=fast query_type=labels length=1h0m0s duration=35.563µs status=500 label= throughput=0B total_bytes=0B total_entries=0
level=debug ts=2022-11-07T13:12:13.931994093Z caller=grpc_logging.go:46 method=/frontendv2pb.FrontendForQuerier/QueryResult duration=105.52µs msg="gRPC (success)"
level=error ts=2022-11-07T13:12:13.932669943Z caller=retry.go:73 org_id=fake msg="error processing request" try=0 err="rpc error: code = Code(500) desc = empty ring\n"
ts=2022-11-07T13:12:13.934128211Z caller=spanlogger.go:80 user=fake method=query.Label level=debug Ingester.TotalReached=0 Ingester.TotalChunksMatched=0 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.TotalChunksRef=0 Ingester.TotalChunksDownloaded=0 Ingester.ChunksDownloadTime=0s Ingester.HeadChunkBytes="0 B" Ingester.HeadChunkLines=0 Ingester.DecompressedBytes="0 B" Ingester.DecompressedLines=0 Ingester.CompressedBytes="0 B" Ingester.TotalDuplicates=0 Querier.TotalChunksRef=0 Querier.TotalChunksDownloaded=0 Querier.ChunksDownloadTime=0s Querier.HeadChunkBytes="0 B" Querier.HeadChunkLines=0 Querier.DecompressedBytes="0 B" Querier.DecompressedLines=0 Querier.CompressedBytes="0 B" Querier.TotalDuplicates=0
ts=2022-11-07T13:12:13.934462835Z caller=spanlogger.go:80 user=fake method=query.Label level=debug Summary.BytesProcessedPerSecond="0 B" Summary.LinesProcessedPerSecond=0 Summary.TotalBytesProcessed="0 B" Summary.TotalLinesProcessed=0 Summary.ExecTime=13.737µs Summary.QueueTime=0s
ts=2022-11-07T13:12:13.935224122Z caller=spanlogger.go:80 user=fake method=query.Label level=info org_id=fake latency=fast query_type=labels length=1h0m0s duration=13.737µs status=500 label= throughput=0B total_bytes=0B total_entries=0
level=debug ts=2022-11-07T13:12:13.936167945Z caller=grpc_logging.go:46 duration=42.182µs method=/frontendv2pb.FrontendForQuerier/QueryResult msg="gRPC (success)"
level=error ts=2022-11-07T13:12:13.936833055Z caller=retry.go:73 org_id=fake msg="error processing request" try=1 err="rpc error: code = Code(500) desc = empty ring\n"
ts=2022-11-07T13:12:13.938052577Z caller=spanlogger.go:80 user=fake method=query.Label level=debug Ingester.TotalReached=0 Ingester.TotalChunksMatched=0 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.TotalChunksRef=0 Ingester.TotalChunksDownloaded=0 Ingester.ChunksDownloadTime=0s Ingester.HeadChunkBytes="0 B" Ingester.HeadChunkLines=0 Ingester.DecompressedBytes="0 B" Ingester.DecompressedLines=0 Ingester.CompressedBytes="0 B" Ingester.TotalDuplicates=0 Querier.TotalChunksRef=0 Querier.TotalChunksDownloaded=0 Querier.ChunksDownloadTime=0s Querier.HeadChunkBytes="0 B" Querier.HeadChunkLines=0 Querier.DecompressedBytes="0 B" Querier.DecompressedLines=0 Querier.CompressedBytes="0 B" Querier.TotalDuplicates=0
ts=2022-11-07T13:12:13.938284912Z caller=spanlogger.go:80 user=fake method=query.Label level=debug Summary.BytesProcessedPerSecond="0 B" Summary.LinesProcessedPerSecond=0 Summary.TotalBytesProcessed="0 B" Summary.TotalLinesProcessed=0 Summary.ExecTime=45.012µs Summary.QueueTime=0s
ts=2022-11-07T13:12:13.938793954Z caller=spanlogger.go:80 user=fake method=query.Label level=info org_id=fake latency=fast query_type=labels length=1h0m0s duration=45.012µs status=500 label= throughput=0B total_bytes=0B total_entries=0
level=debug ts=2022-11-07T13:12:13.939859299Z caller=grpc_logging.go:46 duration=44.651µs method=/frontendv2pb.FrontendForQuerier/QueryResult msg="gRPC (success)"
level=error ts=2022-11-07T13:12:13.940325981Z caller=retry.go:73 org_id=fake msg="error processing request" try=2 err="rpc error: code = Code(500) desc = empty ring\n"
ts=2022-11-07T13:12:13.941421855Z caller=spanlogger.go:80 user=fake method=query.Label level=debug Ingester.TotalReached=0 Ingester.TotalChunksMatched=0 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.TotalChunksRef=0 Ingester.TotalChunksDownloaded=0 Ingester.ChunksDownloadTime=0s Ingester.HeadChunkBytes="0 B" Ingester.HeadChunkLines=0 Ingester.DecompressedBytes="0 B" Ingester.DecompressedLines=0 Ingester.CompressedBytes="0 B" Ingester.TotalDuplicates=0 Querier.TotalChunksRef=0 Querier.TotalChunksDownloaded=0 Querier.ChunksDownloadTime=0s Querier.HeadChunkBytes="0 B" Querier.HeadChunkLines=0 Querier.DecompressedBytes="0 B" Querier.DecompressedLines=0 Querier.CompressedBytes="0 B" Querier.TotalDuplicates=0
ts=2022-11-07T13:12:13.941910948Z caller=spanlogger.go:80 user=fake method=query.Label level=debug Summary.BytesProcessedPerSecond="0 B" Summary.LinesProcessedPerSecond=0 Summary.TotalBytesProcessed="0 B" Summary.TotalLinesProcessed=0 Summary.ExecTime=7.473µs Summary.QueueTime=0s
ts=2022-11-07T13:12:13.942403329Z caller=spanlogger.go:80 user=fake method=query.Label level=info org_id=fake latency=fast query_type=labels length=1h0m0s duration=7.473µs status=500 label= throughput=0B total_bytes=0B total_entries=0
level=debug ts=2022-11-07T13:12:13.943278467Z caller=grpc_logging.go:46 method=/frontendv2pb.FrontendForQuerier/QueryResult duration=28.428µs msg="gRPC (success)"
level=error ts=2022-11-07T13:12:13.943797077Z caller=retry.go:73 org_id=fake msg="error processing request" try=3 err="rpc error: code = Code(500) desc = empty ring\n"
ts=2022-11-07T13:12:13.944847003Z caller=spanlogger.go:80 user=fake method=query.Label level=debug Ingester.TotalReached=0 Ingester.TotalChunksMatched=0 Ingester.TotalBatches=0 Ingester.TotalLinesSent=0 Ingester.TotalChunksRef=0 Ingester.TotalChunksDownloaded=0 Ingester.ChunksDownloadTime=0s Ingester.HeadChunkBytes="0 B" Ingester.HeadChunkLines=0 Ingester.DecompressedBytes="0 B" Ingester.DecompressedLines=0 Ingester.CompressedBytes="0 B" Ingester.TotalDuplicates=0 Querier.TotalChunksRef=0 Querier.TotalChunksDownloaded=0 Querier.ChunksDownloadTime=0s Querier.HeadChunkBytes="0 B" Querier.HeadChunkLines=0 Querier.DecompressedBytes="0 B" Querier.DecompressedLines=0 Querier.CompressedBytes="0 B" Querier.TotalDuplicates=0
ts=2022-11-07T13:12:13.945103343Z caller=spanlogger.go:80 user=fake method=query.Label level=debug Summary.BytesProcessedPerSecond="0 B" Summary.LinesProcessedPerSecond=0 Summary.TotalBytesProcessed="0 B" Summary.TotalLinesProcessed=0 Summary.ExecTime=8.999µs Summary.QueueTime=0s
ts=2022-11-07T13:12:13.945650128Z caller=spanlogger.go:80 user=fake method=query.Label level=info org_id=fake latency=fast query_type=labels length=1h0m0s duration=8.999µs status=500 label= throughput=0B total_bytes=0B total_entries=0
level=debug ts=2022-11-07T13:12:13.946855527Z caller=grpc_logging.go:46 duration=50.382µs method=/frontendv2pb.FrontendForQuerier/QueryResult msg="gRPC (success)"
level=error ts=2022-11-07T13:12:13.947119279Z caller=retry.go:73 org_id=fake msg="error processing request" try=4 err="rpc error: code = Code(500) desc = empty ring\n"
level=error ts=2022-11-07T13:12:13.947827961Z caller=stats.go:57 component=frontend msg="failed to record query metrics" err="expected one of the *LokiRequest, *LokiInstantRequest, *LokiSeriesRequest, *LokiLabelNamesRequest, got "

At the same time, the installation of loki on one server works correctly, logs are written and read.

Tell me, please, where to dig, I can not find any problems.

Also, I tried to connect the distributor to the loki, the logging does not happen either. A 404 error is thrown without any explanation:

level=debug ts=2022-11-07T11:35:11.566374631Z caller=logging.go:76 traceID=7dd739ced5088cbe orgID=fake msg="POST /loki/api/v1/push (404) 87.265µs"
level=debug ts=2022-11-07T11:35:11.653902837Z caller=logging.go:76 traceID=766c347a31371e33 orgID=fake msg="POST /loki/api/v1/push (404) 76.161µs"
level=debug ts=2022-11-07T11:35:11.906294651Z caller=logging.go:76 traceID=58fe0b96a10b3d6b orgID=fake msg="POST /loki/api/v1/push (404) 177.982µs"
level=debug ts=2022-11-07T11:35:11.922494924Z caller=logging.go:76 traceID=0f26b052f9416264 orgID=fake msg="POST /loki/api/v1/push (404) 88.356µs"

At the same time, the s3 storage is available, another lock is written to it in parallel, which is located on a separate server.

andrewkwok2022 commented 1 year ago

hello @Amfiass, I have the similar issue (https://community.grafana.com/t/post-loki-api-v1-push-500-response-empty-ring-and-context-canceled/75544) as you and still no idea how to fix it. Kindly tell me if you find out the solution.

Schmoho commented 1 year ago

I am having the same issue. Any hint or help would be appreciated.

Amfiass commented 1 year ago

@Schmoho @andrewkwok2022 The problem is probably in the key-value store. A possible solution is to change storage to consul or etcd. We solved the problem by deploying the Loki cluster in the kubernetes, since the definition of the members of the ring occurs through the kubernets API.

Schmoho commented 1 year ago

I'm using inmemory kv and any kinds of cloud storage are not an option here

javier-torres commented 1 year ago

Hello @Schmoho @andrewkwok2022

In the official documentation said this:

In this mode the component microservices of Loki are bundled into two targets: -target=read and -target=write. The BoltDB compactor service will run as part of the read target.

You only need deploy section compactor y read node. The config file must be different for -target=read and -target=write.

I had the same problem an change this its works fine.

This is mi basic config for two options:

### -target=read

# Custom Config
auth_enabled: false

server:
  http_listen_port: 3100

memberlist:
  join_members:
    - grafana-loki-read.poc.internal:7946
    - grafana-loki-write.poc.internal:7946

compactor:
  working_directory: /loki/compactor
  shared_store: s3

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: memberlist

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: s3
  aws:
    s3: s3://eu-west-1
    bucketnames: bucket-name

schema_config:
  configs:
    - from: 2020-07-01
      store: boltdb-shipper
      object_store: aws
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

### -target=write

# Custom Config
auth_enabled: false

server:
  http_listen_port: 3100

memberlist:
  join_members:
    - grafana-loki-read.poc.internal:7946
    - grafana-loki-write.poc.internal:7946

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: memberlist

storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: s3
  aws:
    s3: s3://eu-west-1
    bucketnames: bucket-name

schema_config:
  configs:
    - from: 2020-07-01
      store: boltdb-shipper
      object_store: aws
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093
sazima commented 1 year ago

same issue

remram44 commented 1 year ago

Also have this issue. I am running Loki as a single container with common: ring: kvstore: store: inmemory

remram44 commented 1 year ago

Is there documentation for how this "ring" thing is configured? What is the difference between memberlist and inmemory?

A search for inmemory through the entire documentation has like 2 hits. Judging by this issue, this seems like a critical component which could probably be described somewhere?

HuyDo-95g commented 10 months ago

I also have a same issue.

sojjan1337 commented 9 months ago

running loki-distributed. promtail says: caller=client.go:419 component=client host=[loki.test.com](http://loki.test.com/) msg="error send ing batch, will retry" status=500 tenant= error="server returned HTTP status 500 Internal Server Error (500): empty ring"

This error msg comes up as soon as loki is down, why can't promtail reconnect correctly when loki comes up again? If i reinstall promtail again after loki is up, then i have no issues. Or it's isnt a problem, because i can retreive logs as usual.

Vinaum8 commented 4 months ago

same error here

LucaIcaro commented 3 days ago

Same error with monolith and this conf:

# authentication is handled by reverse proxy
auth_enabled: false

server:
  http_listen_port: 3101 # using non default port and leaving the default to nginx
  grpc_listen_port: 9096
  log_level: warn
  grpc_server_max_concurrent_streams: 1000

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki/data
  storage:
    filesystem:
      chunks_directory: /loki/data/chunks
      rules_directory: /loki/data/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
 tsdb_shipper:
   active_index_directory: /loki/data/index
   cache_location: /loki/data/index_cache
 aws:
   s3: s3://my-s3-bucket
   s3forcepathstyle: true
   region: eu-west-1

pattern_ingester:
  enabled: true
  metric_aggregation:
    enabled: true
    loki_address: localhost:3101

ruler:
  alertmanager_url: http://localhost:9093

frontend:
  encoding: protobuf

Using this loki:

loki, version 3.2.0 (branch: k218, revision: 659f5421)
  build user:       root@003ce357cdf4
  build date:       2024-09-18T16:21:52Z
  go version:       go1.22.6
  platform:         linux/amd64
  tags:             netgo

EDIT: solved by changing the configuration of the s3 to this:

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: **aws**
      schema: v13
      index:
        prefix: index_
        period: 24h
...
storage_config:
 tsdb_shipper:
   active_index_directory: /loki/data/index
   cache_location: /loki/data/index_cache
 aws:
   s3: s3://**eu-west-1**
   bucketnames: mybucket

I followed the examples to make the first configuration (https://grafana.com/docs/loki/latest/configure/examples/configuration-examples/#2-s3-cluster-exampleyaml) that ended with a mistake.