grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.84k stars 3.44k forks source link

loki Pod (SingleBinary mode) crashes on start-up with an error "init compactor: failed to init delete store: failed to get s3 object: WebIdentityErr: failed to retrieve credentials" #13780

Open yskopets opened 3 months ago

yskopets commented 3 months ago

Describe the bug

level=info ts=2024-08-06T07:33:51.431999021Z caller=main.go:126 msg="Starting Loki" version="(version=3.1.0, branch=HEAD, revision=935aee77ed)"
level=info ts=2024-08-06T07:33:51.432052642Z caller=main.go:127 msg="Loading configuration file" filename=/etc/loki/config/config.yaml
level=info ts=2024-08-06T07:33:51.434841438Z caller=server.go:352 msg="server listening on addresses" http=[::]:3100 grpc=[::]:9095
level=info ts=2024-08-06T07:33:51.436482651Z caller=memberlist_client.go:435 msg="Using memberlist cluster label and node name" cluster_label= node=loki-0-a47f41d5
level=info ts=2024-08-06T07:33:51.439567902Z caller=memberlist_client.go:541 msg="memberlist fast-join starting" nodes_found=1 to_join=4
level=info ts=2024-08-06T07:33:51.442016002Z caller=memberlist_client.go:561 msg="memberlist fast-join finished" joined_nodes=1 elapsed_time=2.451066ms
level=info ts=2024-08-06T07:33:51.442042483Z caller=memberlist_client.go:573 phase=startup msg="joining memberlist cluster" join_members=loki-memberlist
level=info ts=2024-08-06T07:33:51.445214559Z caller=memberlist_client.go:580 phase=startup msg="joining memberlist cluster succeeded" reached_nodes=1 elapsed_time=3.163961ms
init compactor: failed to init delete store: failed to get s3 object: WebIdentityErr: failed to retrieve credentials
caused by: SerializationError: failed to unmarshal error message
    status code: 405, request id: 
caused by: UnmarshalError: failed to unmarshal error message
    00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version="1|
00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0" encoding="UT|
00000020  46 2d 38 22 3f 3e 0a 3c  45 72 72 6f 72 3e 3c 43  |F-8"?>.<Error><C|
00000030  6f 64 65 3e 4d 65 74 68  6f 64 4e 6f 74 41 6c 6c  |ode>MethodNotAll|
00000040  6f 77 65 64 3c 2f 43 6f  64 65 3e 3c 4d 65 73 73  |owed</Code><Mess|
00000050  61 67 65 3e 54 68 65 20  73 70 65 63 69 66 69 65  |age>The specifie|
00000060  64 20 6d 65 74 68 6f 64  20 69 73 20 6e 6f 74 20  |d method is not |
00000070  61 6c 6c 6f 77 65 64 20  61 67 61 69 6e 73 74 20  |allowed against |
00000080  74 68 69 73 20 72 65 73  6f 75 72 63 65 2e 3c 2f  |this resource.</|
00000090  4d 65 73 73 61 67 65 3e  3c 4d 65 74 68 6f 64 3e  |Message><Method>|
000000a0  50 4f 53 54 3c 2f 4d 65  74 68 6f 64 3e 3c 52 65  |POST</Method><Re|
000000b0  73 6f 75 72 63 65 54 79  70 65 3e 53 45 52 56 49  |sourceType>SERVI|
000000c0  43 45 3c 2f 52 65 73 6f  75 72 63 65 54 79 70 65  |CE</ResourceType|
000000d0  3e 3c 52 65 71 75 65 73  74 49 64 3e 4e 4a 43 57  |><RequestId>NJCW|
000000e0  44 36 43 36 39 46 33 4d  53 37 58 4d 3c 2f 52 65  |D6C69F3MS7XM</Re|
000000f0  71 75 65 73 74 49 64 3e  3c 48 6f 73 74 49 64 3e  |questId><HostId>|
00000100  4f 4a 4c 70 2b 37 56 43  36 66 36 45 78 54 41 66  |OJLp+7VC6f6ExTAf|
00000110  61 48 77 74 51 6b 2f 34  67 64 30 79 50 6b 4f 59  |aHwtQk/4gd0yPkOY|
00000120  50 78 79 73 64 50 2b 5a  51 38 62 7a 6c 62 55 51  |PxysdP+ZQ8bzlbUQ|
00000130  58 4e 42 54 53 32 35 2f  47 6b 38 37 34 58 4e 37  |XNBTS25/Gk874XN7|
00000140  31 35 7a 72 47 39 59 50  4d 4f 45 3d 3c 2f 48 6f  |15zrG9YPMOE=</Ho|
00000150  73 74 49 64 3e 3c 2f 45  72 72 6f 72 3e           |stId></Error>|

caused by: unknown error response tag, {{ Error} []}
error initialising module: compactor
github.com/grafana/dskit/modules.(*Manager).initModule
    /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138
github.com/grafana/dskit/modules.(*Manager).InitModuleServices
    /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108
github.com/grafana/loki/v3/pkg/loki.(*Loki).Run
    /src/loki/pkg/loki/loki.go:458
main.main
    /src/loki/cmd/loki/main.go:129
runtime.main
    /usr/local/go/src/runtime/proc.go:271
runtime.goexit
    /usr/local/go/src/runtime/asm_amd64.s:1695
level=error ts=2024-08-06T07:33:53.670573677Z caller=log.go:216 msg="error running loki" err="init compactor: failed to init delete store: failed to get s3 object: WebIdentityErr: failed to retrieve credentials\ncaused by: SerializationError: failed to unmarshal error message\n\tstatus code: 405, request id: \ncaused by: UnmarshalError: failed to unmarshal error message\n\t00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version=\"1|\n00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0\" encoding=\"UT|\n00000020  46 2d 38 22 3f 3e 0a 3c  45 72 72 6f 72 3e 3c 43  |F-8\"?>.<Error><C|\n00000030  6f 64 65 3e 4d 65 74 68  6f 64 4e 6f 74 41 6c 6c  |ode>MethodNotAll|\n00000040  6f 77 65 64 3c 2f 43 6f  64 65 3e 3c 4d 65 73 73  |owed</Code><Mess|\n00000050  61 67 65 3e 54 68 65 20  73 70 65 63 69 66 69 65  |age>The specifie|\n00000060  64 20 6d 65 74 68 6f 64  20 69 73 20 6e 6f 74 20  |d method is not |\n00000070  61 6c 6c 6f 77 65 64 20  61 67 61 69 6e 73 74 20  |allowed against |\n00000080  74 68 69 73 20 72 65 73  6f 75 72 63 65 2e 3c 2f  |this resource.</|\n00000090  4d 65 73 73 61 67 65 3e  3c 4d 65 74 68 6f 64 3e  |Message><Method>|\n000000a0  50 4f 53 54 3c 2f 4d 65  74 68 6f 64 3e 3c 52 65  |POST</Method><Re|\n000000b0  73 6f 75 72 63 65 54 79  70 65 3e 53 45 52 56 49  |sourceType>SERVI|\n000000c0  43 45 3c 2f 52 65 73 6f  75 72 63 65 54 79 70 65  |CE</ResourceType|\n000000d0  3e 3c 52 65 71 75 65 73  74 49 64 3e 4e 4a 43 57  |><RequestId>NJCW|\n000000e0  44 36 43 36 39 46 33 4d  53 37 58 4d 3c 2f 52 65  |D6C69F3MS7XM</Re|\n000000f0  71 75 65 73 74 49 64 3e  3c 48 6f 73 74 49 64 3e  |questId><HostId>|\n00000100  4f 4a 4c 70 2b 37 56 43  36 66 36 45 78 54 41 66  |OJLp+7VC6f6ExTAf|\n00000110  61 48 77 74 51 6b 2f 34  67 64 30 79 50 6b 4f 59  |aHwtQk/4gd0yPkOY|\n00000120  50 78 79 73 64 50 2b 5a  51 38 62 7a 6c 62 55 51  |PxysdP+ZQ8bzlbUQ|\n00000130  58 4e 42 54 53 32 35 2f  47 6b 38 37 34 58 4e 37  |XNBTS25/Gk874XN7|\n00000140  31 35 7a 72 47 39 59 50  4d 4f 45 3d 3c 2f 48 6f  |15zrG9YPMOE=</Ho|\n00000150  73 74 49 64 3e 3c 2f 45  72 72 6f 72 3e           |stId></Error>|\n\ncaused by: unknown error response tag, {{ Error} []}\nerror initialising module: compactor\ngithub.com/grafana/dskit/modules.(*Manager).initModule\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138\ngithub.com/grafana/dskit/modules.(*Manager).InitModuleServices\n\t/src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108\ngithub.com/grafana/loki/v3/pkg/loki.(*Loki).Run\n\t/src/loki/pkg/loki/loki.go:458\nmain.main\n\t/src/loki/cmd/loki/main.go:129\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:271\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1695"

This is the entire Pod log. Loki container fails to start.

To Reproduce Steps to reproduce the behavior:

  1. Deployed Loki v3.1.0 (Helm Chart, SingleBinary mode) on AWS EKS
  2. Configured AWS S3 storage and retention period
# see https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/
# see https://github.com/grafana/loki/blob/main/production/helm/loki/values.yaml

# based on https://github.com/grafana/loki/blob/v3.1.0/production/helm/loki/single-binary-values.yaml
---
loki:
  # Notice that the name `auth_enabled` is misleading.
  #
  # Loki DOES NOT support authentication - see https://grafana.com/docs/loki/latest/operations/authentication/.
  #
  # The actual effect of the `auth_enabled` field is documented in https://grafana.com/docs/loki/latest/operations/multi-tenancy/#multi-tenancy:
  #
  #   Loki defaults to running in multi-tenant mode. Multi-tenant mode is set in the configuration with `auth_enabled: true`.
  #   When configured with `auth_enabled: false`, Loki uses a single tenant. The `X-Scope-OrgID` header is not required in Loki API requests.
  #   The single tenant ID will be the string `fake`.
  #
  auth_enabled: false
  commonConfig:
    replication_factor: 1
  storage:
    type: s3
    # see https://grafana.com/docs/loki/latest/setup/install/helm/install-monolithic/
    bucketNames:
      chunks: "${chucks_bucket_name}"
      ruler: "${ruler_bucket_name}"
      admin: "${admin_bucket_name}"
    s3:
      endpoint: "s3.${region}.amazonaws.com"
      region: "${region}"
  schemaConfig:
    configs:
      - from: 2024-04-01
        store: tsdb
        object_store: s3
        schema: v13
        index:
          prefix: loki_index_
          period: 24h
  ingester:
    chunk_encoding: snappy
  pattern_ingester:
    # see https://grafana.com/docs/grafana/latest/explore/simplified-exploration/logs/access/#install-in-loki
    enabled: true
  tracing:
    enabled: true
  querier:
    # Default is 4, if you have enough memory and CPU you can increase, reduce if OOMing
    max_concurrent: 2
  # Retention Configuration - see https://grafana.com/docs/loki/latest/operations/storage/retention/
  compactor:
    # Activate custom (per-stream,per-tenant) retention.
    retention_enabled: true
    # Store used for managing delete requests.
    delete_request_store: s3
  limits_config:
    # Retention period to apply to stored data, only applies if retention_enabled is
    # true in the compactor config.
    retention_period: 744h # 31 day
  # By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
  # analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
  #
  # Statistics help us better understand how Loki is used, and they show us performance
  # levels for most users. This helps us prioritize features and documentation.
  # For more information on what's sent, look at
  # https://github.com/grafana/loki/blob/main/pkg/usagestats/stats.go
  # Refer to the buildReport method to see what goes into a report.
  #
  # If you would like to disable reporting, uncomment the following lines: 
  analytics:
    reporting_enabled: false

memberlist:
  service:
    # This setting is required in the SingleBinary mode
    # see https://github.com/grafana/loki/issues/7907#issuecomment-1445336799
    publishNotReadyAddresses: true

# The Loki canary pushes logs to and queries from this loki installation to test
# that it's working correctly
lokiCanary:
  enabled: true
  # Additional annotations for the `loki-canary` Daemonset
  annotations:
    # add annotations recongnized by the auto-discovery mechanism of the monitoring platform
    prometheus.io/scrape: "true"
    prometheus.io/port: "3500"
    prometheus.io/path: /metrics
    k8s.grafana.com/job: "${loki_namespace}/loki-canary"

# SingleBinary: Loki is deployed as a single binary, useful for small installs typically without HA, up to a few tens of GB/day.
deploymentMode: SingleBinary
# Configuration for the single binary node(s)
singleBinary:
  replicas: 1
  resources:
    limits:
      cpu: 1
      memory: 2Gi
    requests:
      cpu: 0.5
      memory: 1Gi
  extraEnv:
    # Keep a little bit lower than memory limits
    - name: GOMEMLIMIT
      value: 1750MiB
  # Annotations for single binary pods
  podAnnotations:
    # add annotations recongnized by the auto-discovery mechanism of the monitoring platform
    prometheus.io/scrape: "true"
    prometheus.io/port: "3100"
    prometheus.io/path: /metrics
    k8s.grafana.com/job: "${loki_namespace}/loki"

# memcached based results-cache
resultsCache:
  enabled: false

# memcached based chunks-cache
chunksCache:
  enabled: false

# By default this chart will deploy a Nginx container to act as a gateway which handles routing of traffic
# and can also do auth.
gateway:
  enabled: false

# Zero out replica counts of other deployment modes
backend:
  replicas: 0
read:
  replicas: 0
write:
  replicas: 0

ingester:
  replicas: 0
querier:
  replicas: 0
queryFrontend:
  replicas: 0
queryScheduler:
  replicas: 0
distributor:
  replicas: 0
compactor:
  replicas: 0
indexGateway:
  replicas: 0
bloomCompactor:
  replicas: 0
bloomGateway:
  replicas: 0

Expected behavior The error should clearly state what part of the config causes it.

Environment:

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem.

yskopets commented 3 months ago

Apparently, the issue was caused by an explicit value for the endpoint field:

loki:
  storage:
    type: s3
    s3:
      endpoint: s3.us-east-2.amazonaws.com

Well, it's confusing.

I took configuration snippet

endpoint: s3.us-east-2.amazonaws.com

from Mimir documentation - https://grafana.com/docs/helm-charts/mimir-distributed/latest/run-production-environment-with-helm/#configure-mimir-to-use-object-storage

Do Loki and Mimir need different values for the endpoint field?

Is it not a good idea, after all, to set an explicit value for the endpoint field?

JStickler commented 1 week ago

I took configuration snippet endpoint: s3.us-east-2.amazonaws.com from Mimir documentation

Are you actually setting up your AWS in the United States? Your GitHub profile says you're in the Netherlands, I would have thought your AWS would be located somewhere in Europe? The value for endpoint should be YOUR endpoint, which is not necessarily going to match the example in the docs.