grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.81k stars 3.43k forks source link

Grafana/Loki - Write to S3 Bucket unsuccessful - failed to flush chunks: store put chunk: WebIdentityErr: failed to retrieve credentials caused by: SerializationError #10657

Open BJWRD opened 1 year ago

BJWRD commented 1 year ago

Hello,

I'm attempting to provision the Grafana/Loki Helm Chart upon an EKS Cluster.

Once deployed all pods are successfully running other than the three loki-write pods -

loki-write-0                                        0/1     Running   0             98m
loki-write-1                                        0/1     Running   0             98m
loki-write-2                                        0/1     Running   0             98m

When viewing the logs, the most notable error received is the one below

failed to flush chunks: store put chunk: WebIdentityErr: failed to retrieve credentials caused by: SerializationError

I've seen multiple previously created issues raised regarding the writing of Loki logs to S3 buckets and I've deployed multiple iterations of the Helm Chart values in an attempt in getting this working - to no avail.

Is there anyone who can help identify the source of my issue? See further information below -

Helm Chart Values -

values:
  serviceAccount:
    create: true
    name: loki-sa
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::********:role/**********"
  loki:
    auth_enabled: false
    storage:
      type: s3
      s3:
        endpoint: https://eu-west-2.s3.amazonaws.com
        region: eu-west-2a
        s3ForcePathStyle: false
        insecure: false
      bucketNames:
        chunks: loki-logs/chunks
        ruler: loki-logs/ruler
        admin: loki-logs/admin
  schema_config:
    configs:
      - from: 2023-04-13
        store: boltdb-shipper
        object_store: s3
        schema: v12
        index:
          prefix: index_
          period: 24h
  storage_config:
    boltdb_shipper:
      shared_store: s3
      cache_ttl: 1h
    aws:
      region: eu-west-2
      bucketnames: loki-logs

IAM Role Policy

  {
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::loki-logs"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": "s3:ListObject",
            "Resource": "arn:aws:s3:::loki-logs"
        }
    ]
}

IAM Role Trust Relationship to EKS Cluster

  {
      "Version": "2012-10-17",
      "Statement": [
          {
              "Effect": "Allow",
              "Principal": {
                  "Service": "s3.amazonaws.com"
              },
              "Action": "sts:AssumeRole"
          },
          {
              "Effect": "Allow",
              "Principal": {
                  "Federated": "arn:aws:iam::************:oidc-provider/oidc.eks.eu-west-2.amazonaws.com/id/******************************************"
              },
              "Action": "sts:AssumeRoleWithWebIdentity",
              "Condition": {
                  "StringEquals": {
                      "oidc.eks.eu-west-2.amazonaws.com/id/******************************************:aud": "sts.amazonaws.com",
                      "oidc.eks.eu-west-2.amazonaws.com/id/******************************************:sub": "system:serviceaccount:loki:loki-sa"
                  }
              }
          }
      ]
}
bdols commented 1 year ago

I'm also seeing this after adding an endpoint to the storage_config. From /etc/loki/config/config.yaml:

  aws:
    bucketnames: <unique>-loki-logs
    endpoint: https://bucket.vpce-<id_str>.s3.us-east-1.vpce.amazonaws.com
    s3: s3://us-east-1
    sse_encryption: true
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 168h
    index_gateway_client:
      log_gateway_requests: true
      server_address: dns:///loki-index-gateway:9095
    shared_store: s3
  filesystem:
    directory: /var/loki/chunks
d-m commented 1 year ago

I'm also getting the same error with S3 in the logs of the loki-write pods along with the bytes of the error message that says:

<Error>
  <Code>MethodNotAllowed</Code>
  <Message>
    The specified method is not allowed against this resource.
  </Message>
  <Method>POST</Method>
  <ResourceType>SERVICE</ResourceType>
  <RequestId>...</RequestId>
  <HostId>...</HostId>
</Error>
d-m commented 1 year ago

I found a similar error at https://gitlab.com/gitlab-org/charts/gitlab/-/issues/3148 and the solution recommended removing endpoint configuration.

I tried removing endpoint here as well and that resolved the problem with loki for me.

dmitry-mightydevops commented 11 months ago

on my side the error was in grafana/loki helm chart (missed loki.storage.s3.region). After adding it the error was gone.

- name: loki.storage.s3.region
  value: {{ .Values.region }}
periklis commented 11 months ago

Defer to this already reported workaround: https://github.com/grafana/loki/issues/5437#issuecomment-1158862015

kaiyuanlim commented 3 months ago

Hi, I don't think the workaround is valid because some of us want to use the endpoint.

In my case, I want to use a vpc endpoint so that the IPs are somewhat more static and it is easier to maintain the IP whitelisting and netpols in our cluster.

nissimuseri commented 3 months ago

Hi, I'm getting the same issue on my setup, both with the workaround suggested here. The same error:

failed to flush chunks: store put chunk: WebIdentityErr: failed to retrieve credentials caused by: SerializationError

occurs when I'm using ServiceAccount, but when using AccessKeys it doesn't happen. Seems to be a but for this chart when trying to deploy app version above 3.x.

hkhairy commented 3 months ago

So, I removed the service account annotation that should've allowed the pod to assume the IAM role, and instead had a role on the EC2 instance itself, and it worked. I'm using Loki 2.9.4

This is weird, as I'm using service account annotations for IAM Role for service accounts (IRSA) with Mimir and Tempo, and it's Okay.

Don't know if IRSA is fixed in Loki 3 or above

sepulworld commented 2 months ago

same issue using service account and IRSA role. Using latest helm chart and Loki 3.x

IAM role, policy and trust policy are good

s3 bucket policy is good

k8s service account is good

yet, the indexer is not using the web identity creds properly.

I switch to IAM keys and it works fine when defining the full s3 path