grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.94k stars 3.45k forks source link

grafana/loki helm chart installation pending on pvc storage even though S3 is configured? #9131

Closed Robsta86 closed 1 year ago

Robsta86 commented 1 year ago

Hi,

I am trying to deploy the grafana loki helmchart (version 5.0.0) on our EKS cluster, and I noticed the read and write pods are stuck in the "pending" state.

After doing some research I figured out that the pods are stuck in the pending state because they are reliant on persistent storage. Since we have no persistent storage configured in our EKS cluster (And we are not intending to do so) the pvc's for these mentioned pods are also stuck in the pending state:

NAME                STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
data-loki-read-0    Pending                                      gp2            10s
data-loki-read-1    Pending                                      gp2            10s
data-loki-read-2    Pending                                      gp2            10s
data-loki-write-0   Pending                                      gp2            10s
data-loki-write-1   Pending                                      gp2            10s
data-loki-write-2   Pending                                      gp2            10s

This is the values.yaml I used to deploy the chart:

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: **MASKED**

loki:
  storage:
    bucketNames:
      chunks: **MASKED**/chunks
      ruler: **MASKED**/ruler
      admin: **MASKED**/admin
    type: s3
    s3:
      endpoint: https://s3.eu-central-1.amazonaws.com/
      region: eu-central-1

  schema_config:
    configs:
      - from: 2023-04-13
        store: boltdb-shipper
        object_store: aws
        schema: v12
        index:
          prefix: index_
          period: 24h

I was under the impression that when S3 is configured as a storage backend there is no need to use persistent storage within the cluster. Did I misconfigure something? Is there a bug in the helm chart? Or.... is this expected behavior?

If the latter is the case it is actually a bit confusing since the loki-distributed helm chart (which is no longer recommended according to a Grafana Loki Configuration webinar I just watched) has no need for persistent storage within kubernetes. And the same goes for the lok-stack helm chart.

jkern888 commented 1 year ago

Weird timing because I just ran into this same issue today too and would very much like to hear an answer. Is the only solution for EKS getting the EBS CSI driver setup?

rubenvw-ngdata commented 1 year ago

Not sure if this is going to resolve the complete problem, but in the schema config the object store should be s3 instead of aws I think.

Robsta86 commented 1 year ago

Not sure if this is going to resolve the complete problem, but in the schema config the object store should be s3 instead of aws I think.

Unfortunately this wont resolve the problem and still everything is pending, waiting for a persistent volume that we cannot provision :)

slim-bean commented 1 year ago

Loki's primary storage is object storage, however it does use a disk for a few things:

For the highest guarantees around not losing any data, you would need to have a persistent volume for these at a minimum on the ingesters (or write) components.

For other components like compactor, index-gateway. This could be ephemeral disk.

If you use ephemeral disk for the components on the write path, your only durability would be provided by the replication factor. So with a rep factor of 3, losing more than 1 disk would result in some data loss of recent data.

kundan89 commented 1 year ago

Events: Type Reason Age From Message


Warning FailedScheduling 4m46s (x105 over 17h) default-scheduler running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition any solution for this

imanAzi commented 1 year ago

We are in the same situation as described here above. Is persistent storage needed for running Loki or can you have a fully working setup of Loki in simple scalable mode only using S3?

phyzical commented 1 year ago

you can, it took be way to long to get it going.

Granted i have not full stress tested it. but confirmed after a day a reboot is able to see the last days logs and the s3 bucket is populated so good enough for my POC for now

Keep in mind the tmpdir is used so you loose ~2 hours of data given this on reboot. but this is adjustable via the compactor i think

hopefully it helps i think the important chucks are

loki:
  ## handles the s3 magic
  storage:
    type: s3
    bucketNames:
      chunks: ${bucket}
      ruler: ${bucket}
    s3:
      region: ${region}
      bucketnames: ${bucket}
  commonConfig:
    # gotta set to tmp dir as no PV
    path_prefix: /tmp/loki
    replication_factor: 1
  schemaConfig:
    configs:
      - from: 2021-05-12
        store: boltdb-shipper
        object_store: s3
        schema: v12
        index:
          prefix: loki_index_
          period: 24h
    storage_config:
      boltdb_shipper:
        shared_store: s3
        cache_ttl: 168h
      aws:
        region: ${region}
        bucketnames: ${bucket}        
singleBinary:
  replicas: 1
  persistence:
    enabled: false

Below is my full helm values

serviceAccount:
  create: false
  name: ${service_account_name}
memberlist:
  service:
    publishNotReadyAddresses: true
loki:
  storage:
    type: s3
    bucketNames:
      chunks: ${bucket}
      ruler: ${bucket}
    s3:
      region: ${region}
      bucketnames: ${bucket}
  commonConfig:
    # gotta set to tmp dir
    path_prefix: /tmp/loki
    replication_factor: 1
  auth_enabled: false
  limits_config:
    ingestion_rate_mb: 20
    ingestion_burst_size_mb: 30
  compactor:
    apply_retention_interval: 1h
    compaction_interval: 5m
    retention_delete_worker_count: 500
    retention_enabled: true
    shared_store: s3
  schemaConfig:
    configs:
      - from: 2021-05-12
        store: boltdb-shipper
        object_store: s3
        schema: v12
        index:
          prefix: loki_index_
          period: 24h
  storage_config:
    boltdb_shipper:
      shared_store: s3
      cache_ttl: 168h
    aws:
      region: ${region}
      bucketnames: ${bucket}
  server:
    http_listen_port: 3100
    grpc_server_max_recv_msg_size: 104857600 # 100 Mb
    grpc_server_max_send_msg_size: 104857600 # 100 Mb
    http_server_write_timeout: 310s
    http_server_read_timeout: 310s
  ingester_client:
    grpc_client_config:
      max_recv_msg_size: 104857600 # 100 Mb
      max_send_msg_size: 104857600 # 100 Mb
  service:
    port: 80
    targetPort: 3100
  url: http://${domain}:{{ .Values.loki.service.port }}
  readinessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 45
  livenessProbe:
    httpGet:
      path: /ready
      port: http-metrics
    initialDelaySeconds: 45
## makes it monolith matching the old stack way
singleBinary:
  replicas: 1
  persistence:
    enabled: false
monitoring:
  selfMonitoring:
    enabled: true
  lokiCanary:
    enabled: false
gateway:
  enabled: false
test:
  enabled: false
ingress:
  enabled: true
  annotations:
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/scheme: internal
    # Use this annotation (which must match a service name) to route traffic to HTTP2 backends.
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP":80}]'
    external-dns.alpha.kubernetes.io/hostname: ${domain}
    alb.ingress.kubernetes.io/shield-advanced-protection: "true"
    external-dns.alpha.kubernetes.io/ingress-hostname-source: annotation-only
    kubernetes.io/ingress.class: alb
  hosts:
    - ${domain}
BJWRD commented 1 year ago

@phyzical Hey! I'm very much experiencing similar issues to what you have previously received by the looks of things. Here's the error which I receive below -

WebIdentityErr: failed to retrieve credentials caused by: SerializationError

I've got the IAM Role/Policy/Trust Relationship correctly configured including the EKS OIDC parameters.

Here's my Helm values - can you see anything which could do with tweaking please?

serviceAccount:
      create: true
      name: loki-sa
      annotations:
        eks.amazonaws.com/role-arn: "arn:aws:iam::**********:role/loki-bucket-role"
    loki:
      auth_enabled: false
      storage:
        type: s3
        s3:
          endpoint: https://eu-west-2.s3.amazonaws.com
          region: eu-west-2a
          s3ForcePathStyle: false
          insecure: false
        bucketNames:
          chunks: loki-logs
          ruler: loki-logs
          admin: loki-logs
    ingress:
      enabled: true
      annotations: 
        cert-manager.io/cluster-issuer: clusterissuer
      hosts:
        - loki.test.com
    schema_config:
      configs:
        - from: 2023-04-13
          store: boltdb-shipper
          object_store: s3
          schema: v12
          index:
            prefix: index_
            period: 24h
      storage_config:
        boltdb_shipper:
          shared_store: s3
          cache_ttl: 168h
        aws:
          region: eu-west-2
          bucketnames: loki-logs
          insecure: false
aravindhkudiyarasan commented 8 months ago

Do we have an update on this issue ?

We are still getting error in loki when we explicitly specify the endpoint . We need it to be set to s3 fips endpoint if we want to use it with Govcloud. Hence this is a blocker in loki.

Loki Config: |

    bucketnames: loki-cluster-backend-1,loki-cluster-backend-2
    insecure: false
    region: us-west-2
    s3forcepathstyle: false
    endpoint: s3.us-west-2.amazonaws.com

{"caller":"table_manager.go:143","err":"WebIdentityErr: failed to retrieve credentials\ncaused by: SerializationError: failed to unmarshal error message\n\tstatus code: 405, request id: \ncaused by: UnmarshalError: failed to unmarshal error message\n\t00000000 3c 3f 78 6d 6c 20 76 65 72 73 69 6f 6e 3d 22 31 |<?xml version=\"1|\n00000010 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 55 54 |.0\" encoding=\"UT|\n00000020 46 2d 38 22 3f 3e 0a 3c 45 72 72 6f 72 3e 3c 43 |F-8\"?>.<C|\n00000030 6f 64 65 3e 4d 65 74 68 6f 64 4e 6f 74 41 6c 6c |ode>MethodNotAll|\n00000040 6f 77 65 64 3c 2f 43 6f 64 65 3e 3c 4d 65 73 73 |owed<Mess|\n00000050 61 67 65 3e 54 68 65 20 73 70 65 63 69 66 69 65 |age>The specifie|\n00000060 64 20 6d 65 74 68 6f 64 20 69 73 20 6e 6f 74 20 |d method is not |\n00000070 61 6c 6c 6f 77 65 64 20 61 67 61 69 6e 73 74 20 |allowed against |\n00000080 74 68 69 73 20 72 65 73 6f 75 72 63 65 2e 3c 2f |this resource.</|\n00000090 4d 65 73 73 61 67 65 3e 3c 4d 65 74 68 6f 64 3e |Message>|\n000000a0 50 4f 53 54 3c 2f 4d 65 74 68 6f 64 3e 3c 52 65 |POST<Re|\n000000b0 73 6f 75 72 63 65 54 79 70 65 3e 53 45 52 56 49 |sourceType>SERVI|\n000000c0 43 45 3c 2f 52 65 73 6f 75 72 63 65 54 79 70 65 |CE</ResourceType|\n000000d0 3e 3c 52 65 71 75 65 73 74 49 64 3e 44 51 48 51 |>DQHQ|\n000000e0 52 31 46 48 47 53 48 59 36 45 43 5a 3c 2f 52 65 |R1FHGSHY6ECZ</Re|\n000000f0 71 75 65 73 74 49 64 3e 3c 48 6f 73 74 49 64 3e |questId>|\n00000100 62 53 79 72 63 55 72 44 53 31 48 71 48 76 47 4d |bSyrcUrDS1HqHvGM|\n00000110 39 32 69 62 48 73 58 76 39 4a 4b 57 48 41 66 4d |92ibHsXv9JKWHAfM|\n00000120 41 31 33 67 7a 79 69 4e 76 69 61 38 6e 69 70 76 |A13gzyiNvia8nipv|\n00000130 50 4d 59 6c 74 5a 35 71 6c 69 2b 48 45 43 72 6b |PMYltZ5qli+HECrk|\n00000140 41 78 42 64 4e 69 2f 45 48 57 77 3d 3c 2f 48 6f |AxBdNi/EHWw=</Ho|\n00000150 73 74 49 64 3e 3c 2f 45 72 72 6f 72 3e |stId>|\n\ncaused by: unknown error response tag, {{ Error} []}","index-store":"boltdb-shipper-2023-11-01","level":"error","msg":"failed to upload table","table":"loki_index_backup_19782","ts":"2024-02-29T09:56:11.318328359Z"}