grafana / helm-charts

Apache License 2.0
1.67k stars 2.28k forks source link

[loki-distributed] Error Ingester read-only file system to S3 with TSDB in EKS #3133

Open northonheld opened 6 months ago

northonheld commented 6 months ago

Hello, even following the guidelines on how to configure S3 in the official documentation, Ingester is unable to communicate with the s3 bucket.

Chart version loki-distributed: 0.79.0

I used this doc link for these values:

  schemaConfig:
    configs:
      - from: "2020-07-01"
        store: tsdb
        object_store: aws
        schema: v13
        index:
          prefix: index_
          period: 24h

  storageConfig:
    tsdb_shipper:
     active_index_directory: /loki/index
     cache_location: /loki/index_cache
     cache_ttl: 24h
    aws:
      s3: s3://us-east-1
      bucketnames: xxxx-infra-loki

serviceAccount:
  create: true
  annotations:
    eks.amazonaws.com/role-arn: "arn:aws:iam::1XXXXXXXXX0:role/sa-bucket-loki-infra"                                                 
  automountServiceAccountToken: true

ingester:
  persistence:
    enabled: false

I created the objects in AWS using Crossplane, based on the Terraform module officially provided at this link

---
apiVersion: iam.aws.crossplane.io/v1beta1
kind: Policy
metadata:
  name: sa-bucket-loki-infra-policy
  namespace: loki
spec:
  deletionPolicy: Delete
  forProvider:
    name: sa-bucket-loki-infra-policy
    description: Allow access to S3 loki
    document: >
      {
        "Version": "2012-10-17",
        "Statement": [
          {
            "Effect": "Allow",
            "Action": [
              "s3:*"
            ],
            "Resource": [
              "arn:aws:s3:::xxxx-infra-loki",
              "arn:aws:s3:::xxxx-infra-loki/*"
            ]
          }
        ]
      }
  providerConfigRef:
    name: crossplane-cred-infra
---
apiVersion: iam.aws.crossplane.io/v1beta1
kind: Role
metadata:
  name: sa-bucket-loki-infra-role
  namespace: loki
spec:
  forProvider:
    assumeRolePolicyDocument: |
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Effect": "Allow",
                  "Action": "sts:AssumeRoleWithWebIdentity",
                  "Principal": {
                      "Federated": "arn:aws:iam::1XXXXXXXXXX0:oidc-provider/oidc.eks.us-east-1.amazonaws.com/id/4XXXXXXXXXXXXXXXXXXXXXXXXXXXXX8"
                  },
                  "Condition": {
                      "StringEquals": {
                          "oidc.eks.us-east-1.amazonaws.com/id/4XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX8:aud": [
                              "sts.amazonaws.com"
                          ],
                          "oidc.eks.us-east-1.amazonaws.com/id/4XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX8:sub": [
                              "system:serviceaccount:loki:loki-eks-infra-loki-distributed"
                          ]
                      }
                  }
              }
          ]
      }
    tags:
      - key: loki
        value: infra
  providerConfigRef:
    name: crossplane-cred-infra
---
apiVersion: iam.aws.crossplane.io/v1beta1
kind: RolePolicyAttachment
metadata:
  name: sa-bucket-loki-infra-att
  namespace: loki
spec:
  forProvider:
    policyArnRef:
      name: sa-bucket-loki-infra-policy
    roleNameRef:
      name: sa-bucket-loki-infra-role
  providerConfigRef:
    name: crossplane-cred-infra
---
apiVersion: s3.aws.crossplane.io/v1beta1
kind: Bucket
metadata:
  name: xxxxx-infra-loki
  namespace: loki
spec:
  deletionPolicy: Orphan
  forProvider:
    locationConstraint: us-east-1
    objectOwnership: BucketOwnerEnforced
    paymentConfiguration:
      payer: BucketOwner
    publicAccessBlockConfiguration:
      blockPublicAcls: true
      blockPublicPolicy: true
      ignorePublicAcls: true
      restrictPublicBuckets: true
    tagging:
      tagSet:
        - key: xxxx
          value: xxx
  providerConfigRef:
    name: crossplane-cred-infra

I think created it correctly: Trust relationsships from role: trsut-sts

Policy Attachment: role-policy

I checked the annotation of the service account in eks and the annotation is correct: kubectl get sa/loki-eks-infra-loki-distributed -n loki -o yaml

apiVersion: v1
automountServiceAccountToken: true
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::1XXXXXXXXX0:role/sa-bucket-loki-infra-role
  labels:
    app.kubernetes.io/instance: loki-eks-infra
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: loki-distributed
    app.kubernetes.io/version: 2.9.6
    argocd.argoproj.io/instance: loki-eks-infra
    helm.sh/chart: loki-distributed-0.79.0
  name: loki-eks-infra-loki-distributed
  namespace: loki

And Statefulset ingester: kubectl get sts/loki-eks-infra-loki-distributed-ingester -n loki -o yaml

      serviceAccount: loki-eks-infra-loki-distributed
      serviceAccountName: loki-eks-infra-loki-distributed

But I'm still getting this error in Ingester:

loki-eks-infra-loki-distributed-ingester-0 ingester mkdir /loki/index: read-only file system
loki-eks-infra-loki-distributed-ingester-0 ingester error initialising module: store
loki-eks-infra-loki-distributed-ingester-0 ingester github.com/grafana/dskit/modules.(*Manager).initModule
loki-eks-infra-loki-distributed-ingester-0 ingester     /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:138
loki-eks-infra-loki-distributed-ingester-0 ingester github.com/grafana/dskit/modules.(*Manager).InitModuleServices
loki-eks-infra-loki-distributed-ingester-0 ingester     /src/loki/vendor/github.com/grafana/dskit/modules/modules.go:108
loki-eks-infra-loki-distributed-ingester-0 ingester github.com/grafana/loki/pkg/loki.(*Loki).Run
loki-eks-infra-loki-distributed-ingester-0 ingester     /src/loki/pkg/loki/loki.go:461
loki-eks-infra-loki-distributed-ingester-0 ingester main.main
loki-eks-infra-loki-distributed-ingester-0 ingester     /src/loki/cmd/loki/main.go:110
loki-eks-infra-loki-distributed-ingester-0 ingester runtime.main
loki-eks-infra-loki-distributed-ingester-0 ingester     /usr/local/go/src/runtime/proc.go:267
loki-eks-infra-loki-distributed-ingester-0 ingester runtime.goexit
loki-eks-infra-loki-distributed-ingester-0 ingester     /usr/local/go/src/runtime/asm_amd64.s:1650

I can't figure out where my mistake lies. I followed all the guidelines in the official documentation. Why doesn't it work? Did I forget to enable something?

This is the current view of my pods

NAME                                                              READY   STATUS             RESTARTS          AGE
loki-eks-infra-loki-distributed-distributor-6955bf69b7-4p76p      1/1     Running            0                 66m
loki-eks-infra-loki-distributed-distributor-6955bf69b7-n6zz9      1/1     Running            0                 3h8m
loki-eks-infra-loki-distributed-index-gateway-0                   0/1     CrashLoopBackOff   506 (64s ago)     42h
loki-eks-infra-loki-distributed-ingester-0                        0/1     CrashLoopBackOff   21 (3m27s ago)    86m
loki-eks-infra-loki-distributed-ingester-1                        0/1     CrashLoopBackOff   21 (3m21s ago)    86m
loki-eks-infra-loki-distributed-ingester-2                        0/1     CrashLoopBackOff   21 (3m33s ago)    86m
loki-eks-infra-loki-distributed-querier-769d9f7c75-dfr2v          1/1     Running            0                 66m
loki-eks-infra-loki-distributed-querier-769d9f7c75-s8zzj          1/1     Running            0                 3h9m
loki-eks-infra-loki-distributed-query-frontend-6c8f5f5f54-cbh46   0/1     Running            9 (3m13s ago)     51m
loki-eks-infra-loki-distributed-query-frontend-c576b64-gfn4p      0/1     Running            481 (5m16s ago)   42h
wondersd commented 4 months ago

@northonheld I think you'll want to remove the filesystem path overrides added here:

storageConfig:
    tsdb_shipper:
     active_index_directory: /loki/index
     cache_location: /loki/index_cache

The data volume is mounted at /var/loki for ingester/index gateway

https://github.com/grafana/helm-charts/blob/loki-distributed-0.79.0/charts/loki-distributed/templates/ingester/statefulset-ingester.yaml#L120-L121

https://github.com/grafana/helm-charts/blob/loki-distributed-0.79.0/charts/loki-distributed/templates/index-gateway/statefulset-index-gateway.yaml#L111-L112

And the chart defaults this configuration to the correct values:

https://github.com/grafana/helm-charts/blob/loki-distributed-0.79.0/charts/loki-distributed/values.yaml#L244-L245

kites-k8s commented 1 week ago

Thanks @wondersd, it worked. It seems this configuration is NOT valid for distributed mode.

  tsdb_shipper:
    active_index_directory: /data/tsdb-index
    cache_location: /data/tsdb-cache

https://grafana.com/docs/loki/latest/operations/storage/tsdb/#example-configuration

At lease I am no longer having crashloopbackoff problem due to read-only issue.