grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.8k stars 3.43k forks source link

Loki v3.2.0 error: "caller=series_index_store.go:629 org_id=fake traceID=5d9f58c4925be2d0 msg="error querying storage" err="database not open"" #14491

Open PRAD-KANAKAPURA opened 2 weeks ago

PRAD-KANAKAPURA commented 2 weeks ago

Describe the bug We recently upgraded Loki from v2.9.2 to v3.2.0, both running in standalone mode inside Docker. Since this migration, we’ve encountered intermittent issues where Grafana is unable to reach Loki. This problem started only after moving to v3.2.0 and is inconsistent—Grafana successfully retrieves data for a while, but the issue reappears later.

Details:

I am still getting used to the tool, not an expert. Kindly excuse if the question is silly.

level=error ts=2024-10-15T14:42:02.837049822Z  caller=series_index_store.go:629 org_id=fake traceID=5d9f58c4925be2d0 msg="error querying storage" err="database not open"
level=error ts=2024-10-15T15:00:54.492457768Z caller=retry.go:95 org_id=fake traceID=2f6d3472812a83a2 msg="error processing request" try=2 query="last_over_time({filename=\"/var/log/failed_routes.log\"}\n|~ \"ROUTES\"\n| logfmt\n| unwrap ROUTES\n[10800s])" query_hash=1519236867 start=2024-10-15T15:00:50Z end=2024-10-15T15:00:50Z start_delta=4.492452468s end_delta=4.492452648s length=0s retry_in=4.214616769s err="rpc error: code = Code(500) desc = database not open"
level=warn ts=2024-10-15T15:01:07.522196405Z caller=logging.go:126 trace_id=2f6d3472812a83a2 orgID=fake msg="GET /loki/api/v1/query?direction=backward&query=last_over_time%28%7Bfilename%3D%22%2Fvar%2Flog%2Ffailed_routes.log%22%7D%0A%7C~+%22ROUTES%22%0A%7C+logfmt%0A%7C+unwrap+ROUTES%0A%5B10800s%5D%29&time=1729004450000000000 (500) 17.519488651s Response: \"database not open\" ws: false; Accept-Encoding: gzip; Connection: close; Fromalert: true; User-Agent: Grafana/10.4.0; X-Loki-Response-Encoding-Flags: categorize-labels; "

The following are the modifications that we have brought in the new schema.

To Reproduce Steps to reproduce the behavior:

  1. The following is the loki-config.yml that we have used for the test and also for the prod.
auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

    - from: 2024-10-21
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_tsdb_
        period: 24h
      chunks:
        prefix: "chunk_"
        period: 24h 

storage_config:
  boltdb_shipper:
    active_index_directory: /tmp/loki/chunks/index
    cache_location: /tmp/loki/boltdb-cache

  tsdb_shipper:
    active_index_directory: /tmp/loki/chunks/index_tsdb
    cache_location: /tmp/loki/tsdb-cache
  filesystem:
    directory: /tmp/loki/chunks

limits_config:
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 10
  max_query_series: 5000
  retention_period: 2160h
  allow_structured_metadata: false

ruler:
  alertmanager_url: http://localhost:9093

query_scheduler:
  max_outstanding_requests_per_tenant: 4096

compactor:
  working_directory: /tmp/loki/retention
  retention_enabled: true
  retention_delete_delay: 48h
  delete_request_store: filesystem

Expected behavior Grafana should be able to consistently retrieve data from Loki without intermittent connection failures even after migrating to v3.2.0.

Environment:

Image

JihadMotii-REISys commented 3 days ago

Hey @PRAD-KANAKAPURA, have you found the root cause of this issue?

PRAD-KANAKAPURA commented 2 days ago

Hey @PRAD-KANAKAPURA, have you found the root cause of this issue?

Hi, yes we were suspecting the migration issue of having two different schema in place (botldb and tsdb), after the successful new tsdb in place from the future date, it started to work well and now its all good and there is no error.

JStickler commented 2 days ago

@PRAD-KANAKAPURA It sounds like migrating to TSDB solved the problem. Can we close this issue?