grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.49k stars 3.4k forks source link

v3.0.0: loki backend SIGSEGV if index_gateway.mode: ring #12270

Open awoimbee opened 6 months ago

awoimbee commented 6 months ago

Describe the bug Running version grafana/loki:main-0bf894b, loki-backend (replicas: 1) crashes:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x288 pc=0x223f470]

goroutine 1 [running]:
github.com/grafana/loki/pkg/loki.(*Loki).updateConfigForShipperStore(0xc000638be0?)
    /src/loki/pkg/loki/modules.go:709 +0xb0
github.com/grafana/loki/pkg/loki.(*Loki).initBloomStore(0xc000d3c000)
    /src/loki/pkg/loki/modules.go:663 +0x68
github.com/grafana/dskit/modules.(*Manager).initModule(0xc000c86720, {0x7ffe92a04bb1, 0x7}, 0x0?, 0x42?)
    /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x1f7
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x0?, {0xc000ce2990, 0x1, 0x40d39a?})
    /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xd8
github.com/grafana/loki/pkg/loki.(*Loki).Run(0xc000d3c000, {0x0?, {0x4?, 0x3?, 0x4751b00?}})
    /src/loki/pkg/loki/loki.go:431 +0x9d

Workaround: edit the configmap, change index_gateway.mode from ring to simple. Note that I use tsdb, having a boltdb config or not in storage_config does not change anything.

Environment:

awoimbee commented 5 months ago

Closing since there have been some releases since, if it still happens I'll reopen

Nissou31 commented 5 months ago

Happend for me today while deploying a simple scalable loki 3.0.0 only on backend pod

alexandergoncharovaspecta commented 5 months ago

The same problem only the difference i have 3 pods 2 are ok 1 - CrashLoopBack

k8 logs -n observability loki-backend-1 -c loki panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation code=0x1 addr=0x288 pc=0x22f02b0]

goroutine 1 [running]: github.com/grafana/loki/v3/pkg/loki.(Loki).updateConfigForShipperStore(0xc0006d5ea0?) /src/loki/pkg/loki/modules.go:755 +0xb0 github.com/grafana/loki/v3/pkg/loki.(Loki).initBloomStore(0xc000cab500) /src/loki/pkg/loki/modules.go:715 +0x68 github.com/grafana/dskit/modules.(Manager).initModule(0xc0004f2f90, {0x7fffb01fda84, 0x7}, 0x1?, 0xc00096e1e0?) /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x1f7 github.com/grafana/dskit/modules.(Manager).InitModuleServices(0x0?, {0xc00097ca80, 0x1, 0xc0005a9b30?}) /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xd8 github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0xc000cab500, {0x0?, {0x4?, 0x3?, 0x4912940?}}) /src/loki/pkg/loki/loki.go:453 +0x9d main.main() /src/loki/cmd/loki/main.go:122 +0x113b

chaudum commented 5 months ago

@alexandergoncharovaspecta Can you provide your config?

chaudum commented 5 months ago

@alexandergoncharovaspecta Can you provide your config?

I am able to reproduce the bug on the release-3.0.x branch using

$ ./cmd/loki/loki -target=backend -index-gateway.mode=ring
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x288 pc=0x22efff0]

goroutine 1 [running]:
github.com/grafana/loki/v3/pkg/loki.(*Loki).updateConfigForShipperStore(0xc0008b8960?)
    /home/christian/sandbox/grafana/loki/pkg/loki/modules.go:755 +0xb0
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomStore(0xc0007c9500)
    /home/christian/sandbox/grafana/loki/pkg/loki/modules.go:715 +0x68
github.com/grafana/dskit/modules.(*Manager).initModule(0xc00063c780, {0x7fffab192a32, 0x7}, 0x1?, 0xc000eb8d20?)
    /home/christian/sandbox/grafana/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x1f7
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x0?, {0xc000a0dc20, 0x1, 0xc000eb8bd0?})
    /home/christian/sandbox/grafana/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xd8
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0xc0007c9500, {0x0?, {0x4?, 0x3?, 0x493d3e0?}})
    /home/christian/sandbox/grafana/loki/pkg/loki/loki.go:453 +0x9d
main.main()
    /home/christian/sandbox/grafana/loki/cmd/loki/main.go:122 +0x113b
alexandergoncharovaspecta commented 5 months ago

@alexandergoncharovaspecta Can you provide your config?

I am able to reproduce the bug on the release-3.0.x branch using

$ ./cmd/loki/loki -target=backend -index-gateway.mode=ring
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x288 pc=0x22efff0]

goroutine 1 [running]:
github.com/grafana/loki/v3/pkg/loki.(*Loki).updateConfigForShipperStore(0xc0008b8960?)
  /home/christian/sandbox/grafana/loki/pkg/loki/modules.go:755 +0xb0
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomStore(0xc0007c9500)
  /home/christian/sandbox/grafana/loki/pkg/loki/modules.go:715 +0x68
github.com/grafana/dskit/modules.(*Manager).initModule(0xc00063c780, {0x7fffab192a32, 0x7}, 0x1?, 0xc000eb8d20?)
  /home/christian/sandbox/grafana/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x1f7
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x0?, {0xc000a0dc20, 0x1, 0xc000eb8bd0?})
  /home/christian/sandbox/grafana/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xd8
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0xc0007c9500, {0x0?, {0x4?, 0x3?, 0x493d3e0?}})
  /home/christian/sandbox/grafana/loki/pkg/loki/loki.go:453 +0x9d
main.main()
  /home/christian/sandbox/grafana/loki/cmd/loki/main.go:122 +0x113b

Yes

Source: loki/templates/config.yaml

apiVersion: v1 kind: ConfigMap metadata: name: loki namespace: observability labels: helm.sh/chart: loki-6.3.4 app.kubernetes.io/name: loki app.kubernetes.io/instance: loki app.kubernetes.io/version: "3.0.0" app.kubernetes.io/managed-by: Helm data: config.yaml: |

auth_enabled: false
chunk_store_config:
  chunk_cache_config:
    background:
      writeback_buffer: 500000
      writeback_goroutines: 1
      writeback_size_limit: 500MB
    default_validity: 0s
    memcached:
      batch_size: 4
      parallelism: 5
    memcached_client:
      addresses: dnssrvnoa+_memcached-client._tcp.loki-chunks-cache.observability.svc
      consistent_hash: true
      max_idle_conns: 72
      timeout: 2000ms
common:
  compactor_address: 'http://loki-backend:3100'
  path_prefix: /var/loki
  replication_factor: 3
  storage:
    azure:
      account_key: ${LOKI_AZURE_ACCOUNT_KEY}
      account_name: ${LOKI_AZURE_ACCOUNT_NAME}
      container_name: chunks
      use_federated_token: false
      use_managed_identity: false
frontend:
  scheduler_address: ""
  tail_proxy_url: http://loki-querier.observability.svc.cluster.local:3100
frontend_worker:
  scheduler_address: ""
index_gateway:
  mode: ring
limits_config:
  allow_structured_metadata: false
  max_cache_freshness_per_query: 10m
  max_query_parallelism: 32
  max_query_series: 100000
  query_timeout: 300s
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 720h
  split_queries_by_interval: 15m
  tsdb_max_query_parallelism: 512
  volume_enabled: true
memberlist:
  join_members:
  - loki-memberlist
pattern_ingester:
  enabled: false
querier:
  max_concurrent: 16
query_range:
  align_queries_with_step: true
  cache_results: true
  results_cache:
    cache:
      background:
        writeback_buffer: 500000
        writeback_goroutines: 1
        writeback_size_limit: 500MB
      default_validity: 12h
      memcached_client:
        addresses: dnssrvnoa+_memcached-client._tcp.loki-results-cache.observability.svc
        consistent_hash: true
        timeout: 500ms
        update_interval: 1m
query_scheduler:
  max_outstanding_requests_per_tenant: 32768
ruler:
  storage:
    azure:
      account_key: ${LOKI_AZURE_ACCOUNT_KEY}
      account_name: ${LOKI_AZURE_ACCOUNT_NAME}
      container_name: ruler
      use_federated_token: false
      use_managed_identity: false
    type: azure
runtime_config:
  file: /etc/loki/runtime-config/runtime-config.yaml
schema_config:
  configs:
  - from: "2024-02-29"
    index:
      period: 24h
      prefix: loki_index_
    object_store: azure
    schema: v13
    store: tsdb
server:
  grpc_listen_port: 9095
  http_listen_port: 3100
  http_server_read_timeout: 600s
  http_server_write_timeout: 600s
storage_config:
  boltdb_shipper:
    index_gateway_client:
      server_address: dns+loki-backend-headless.observability.svc.cluster.local:9095
  hedging:
    at: 250ms
    max_per_second: 20
    up_to: 3
  tsdb_shipper:
    index_gateway_client:
      server_address: dns+loki-backend-headless.observability.svc.cluster.local:9095
tracing:
  enabled: false
sslny57 commented 4 months ago

I am experiencing the same

kubectl logs loki-backend-1 -c loki

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x288 pc=0x22f02b0]

goroutine 1 [running]:
github.com/grafana/loki/v3/pkg/loki.(*Loki).updateConfigForShipperStore(0xc0009e0f00?)
        /src/loki/pkg/loki/modules.go:755 +0xb0
github.com/grafana/loki/v3/pkg/loki.(*Loki).initBloomStore(0xc00178c000)
        /src/loki/pkg/loki/modules.go:715 +0x68
github.com/grafana/dskit/modules.(*Manager).initModule(0xc000010ea0, {0x7ffde2dd827d, 0x7}, 0x1?, 0xc0017800c0?)
        /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:136 +0x1f7
github.com/grafana/dskit/modules.(*Manager).InitModuleServices(0x0?, {0xc000b8bef0, 0x1, 0xc000b3fa40?})
        /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108 +0xd8
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run(0xc00178c000, {0x0?, {0x4?, 0x3?, 0x4912940?}})
        /src/loki/pkg/loki/loki.go:453 +0x9d
main.main()
        /src/loki/cmd/loki/main.go:122 +0x113b
sslny57 commented 4 months ago

What is the fix for this issue?

sslny57 commented 4 months ago

I see index_gateway.mode from ring to simple. was the fix but now I am stuck with some other error in gateway pod https://github.com/grafana/loki/issues/12912

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  7m18s                  default-scheduler  Successfully assigned vector/my-loki-gateway-66f8b59d65-jx7lw to ip-10-0-3-21.eu-west-2.compute.internal
  Normal   Pulled     7m18s                  kubelet            Container image "docker.io/nginxinc/nginx-unprivileged:1.24-alpine" already present on machine
  Normal   Created    7m18s                  kubelet            Created container nginx
  Normal   Started    7m18s                  kubelet            Started container nginx
  Warning  Unhealthy  2m8s (x33 over 6m58s)  kubelet            Readiness probe errored: strconv.Atoi: parsing "http": invalid syntax
sslny57 commented 4 months ago

fixed this making change to helm

https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml#L337-L345

  readinessProbe:
    httpGet:
      path: /
      port: http-metrics
    initialDelaySeconds: 15
    timeoutSeconds: 1``
sslny57 commented 4 months ago

pod is coming up but loki is not working as expected

A Status: 500. Message: Get "http://loki-gateway.vector.svc.cluster.local/loki/api/v1/query_range?direction=backward&end=1716517388130000000&query=sum+by%28MAC%29+%28count_over_time%28%7BSTATUS%3D%22errObj.error.status%22%7D%5B15s%5D%29%29&start=1716495780000000000&step=15000ms": dial tcp: lookup loki-gateway.vector.svc.cluster.local: no such host in 5.47.2 this used to work:

  readinessProbe:
    httpGet:
      path: /
      port: http
    initialDelaySeconds: 15
    timeoutSeconds: 1

when used I am getting same error:

   TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  5m7s                  default-scheduler  Successfully assigned vector/my-loki-gateway-66f8b59d65-drkgc to ip-10-0-1-223.eu-west-2.compute.internal
  Normal   Pulled     5m7s                  kubelet            Container image "docker.io/nginxinc/nginx-unprivileged:1.24-alpine" already present on machine
  Normal   Created    5m7s                  kubelet            Created container nginx
  Normal   Started    5m7s                  kubelet            Started container nginx
  Warning  Unhealthy  97s (x22 over 4m47s)  kubelet            Readiness probe errored: strconv.Atoi: parsing "http": invalid syntax
sslny57 commented 4 months ago

i had to use the service IP in vector endpoint along with previous fix

  readinessProbe:
    httpGet:
      path: /
      port: http-metrics
    initialDelaySeconds: 15
    timeoutSeconds: 1``
sinks:
    loki:
      type: "loki"
      inputs:
      - "lambda_source"
      # endpoint: "http://loki-gateway.vector.svc.cluster.local"
      endpoint: "http://10.160.197.234"
      path: "/loki/api/v1/push"
      encoding:
        codec: "json"
      tenant_id: "lokiprod"
      healthcheck:
        enabled: true
      labels:

now its working as expected

abh commented 3 months ago

I ran into this crash too upgrading from v2.9.x to v3.0.0. Changing the mode from ring to simple fixed this crash (but still working through other problems).

acar-ctpe commented 3 months ago

I'm hitting the same problem