grafana / helm-charts

Apache License 2.0
1.63k stars 2.26k forks source link

[loki-simple-scalable] s3/IRSA authentcation issues upgrading from 0.4.0 #1550

Closed cebidhem closed 2 years ago

cebidhem commented 2 years ago

Hey community,

I've tried to upgrade loki-simple-scalable from 0.4.0 to 1.4.3, and we are running into a lot of issues, some resolved but some not at all. I've tried to look for similar issues in GitHub, found some but still, even following some of the resolutions proposed, it's still not running properly. I've tried to look also in the documentation - readme and online docs - but we know from the beginning it's not up-to-date.

We are using a quite basic setup, s3 as data backend and IRSA for authentication. I tried a few configuration, initially the bucketname was not set properly, then without setting accesskey and secretacesskey (as in 0.4.0), it fails because of the chart values defaults, and setting them to null gives me an unmarshall error.

0.4.0 values:

    fullnameOverride: loki
    read:
      persistence:
        size: 5Gi
        storageClass: ebs-sc
    write:
      persistence:
        size: 5Gi
        storageClass: ebs-sc
    loki:
      commonConfig:
        storage:
          filesystem: null
          s3:
            s3: s3://eu-west-1
            bucketnames: my-company-loki-objstore
      schemaConfig:
        configs:
          - from: "2020-10-24"
            store: boltdb-shipper
            object_store: s3
            schema: v11
            index:
              prefix: loki_index_
              period: 24h
      storageConfig:
        boltdb_shipper:
          shared_store: s3
    gateway:
      ingress:
        enabled: true
        ingressClassName: ingress-nginx
        annotations:
          kubernetes.io/tls-acme: "true"
          cert-manager.io/cluster-issuer: letsencrypt
        hosts:
          - host: loki.mydomain.com
            paths:
              - path: /
                pathType: Prefix
        tls:
          - secretName: loki-gateway-tls
            hosts:
              - loki.mydomain.com
    serviceAccount:
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::redacted_aws_account_id:role/loki-irsa-role
    serviceMonitor:
      enabled: true

This works perfectly fine.

1.4.3 values:

    fullnameOverride: loki
    gateway:
      ingress:
        annotations:
          cert-manager.io/cluster-issuer: letsencrypt
          kubernetes.io/tls-acme: "true"
        enabled: true
        hosts:
        - host: loki.mydomain.com
          paths:
          - path: /
            pathType: Prefix
        ingressClassName: ingress-nginx-global
        tls:
        - hosts:
          - loki.mydomain.com
          secretName: loki-gateway-tls
    loki:
      commonConfig:
        storage:
          filesystem: null
          s3:
            bucketnames: my-company-loki-objstore
            s3: s3://eu-west-1
      schemaConfig:
        configs:
        - from: "2020-10-24"
          index:
            period: 24h
            prefix: loki_index_
          object_store: s3
          schema: v11
          store: boltdb-shipper
        - from: "2022-07-01"
          index:
            period: 24h
            prefix: loki_index_
          object_store: s3
          schema: v12
          store: boltdb-shipper
      storage:
        bucketNames:
          chunks: my-company-loki-objstore
        s3:
          endpoint: s3.eu-west-1.amazonaws.com
          region: eu-west-1
          s3: s3://eu-west-1
          secretAccessKey: null
          accessKeyId: null
        type: s3
      storageConfig:
        boltdb_shipper:
          shared_store: s3
    nameOverride: loki
    read:
      persistence:
        size: 5Gi
        storageClass: ebs-sc
      replicas: 1
    serviceAccount:
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::redacted_aws_account_id:role/loki-irsa-role
    serviceMonitor:
      enabled: true
    write:
      persistence:
        size: 5Gi
        storageClass: ebs-sc
      replicas: 1

Those values gives me the following error stacks:

level=error ts=2022-06-30T08:36:17.5642286Z caller=flush.go:222 org_id=fake msg="failed to flush user" err="WebIdentityErr: failed to retrieve credentials\ncaused by: SerializationError: failed to unmarshal error message\n\tstatus code: 405, request id: \ncaused by: UnmarshalError: failed to unmarshal error message\n\t00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version=\"1|\n00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0\" encoding=\"UT|\n00000020  46 2d 38 22 3f 3e 0a 3c  45 72 72 6f 72 3e 3c 43  |F-8\"?>.<Error><C|\n00000030  6f 64 65 3e 4d 65 74 68  6f 64 4e 6f 74 41 6c 6c  |ode>MethodNotAll|\n00000040  6f 77 65 64 3c 2f 43 6f  64 65 3e 3c 4d 65 73 73  |owed</Code><Mess|\n00000050  61 67 65 3e 54 68 65 20  73 70 65 63 69 66 69 65  |age>The specifie|\n00000060  64 20 6d 65 74 68 6f 64  20 69 73 20 6e 6f 74 20  |d method is not |\n00000070  61 6c 6c 6f 77 65 64 20  61 67 61 69 6e 73 74 20  |allowed against |\n00000080  74 68 69 73 20 72 65 73  6f 75 72 63 65 2e 3c 2f  |this resource.</|\n00000090  4d 65 73 73 61 67 65 3e  3c 4d 65 74 68 6f 64 3e  |Message><Method>|\n000000a0  50 4f 53 54 3c 2f 4d 65  74 68 6f 64 3e 3c 52 65  |POST</Method><Re|\n000000b0  73 6f 75 72 63 65 54 79  70 65 3e 53 45 52 56 49  |sourceType>SERVI|\n000000c0  43 45 3c 2f 52 65 73 6f  75 72 63 65 54 79 70 65  |CE</ResourceType|\n000000d0  3e 3c 52 65 71 75 65 73  74 49 64 3e 4a 31 44 46  |><RequestId>J1DF|\n000000e0  35 33 47 48 54 56 43 44  30 33 59 41 3c 2f 52 65  |53GHTVCD03YA</Re|\n000000f0  71 75 65 73 74 49 64 3e  3c 48 6f 73 74 49 64 3e  |questId><HostId>|\n00000100  30 6c 66 79 75 37 47 6b  76 58 65 2b 30 38 57 62  |0lfyu7GkvXe+08Wb|\n00000110  4d 4e 32 2f 64 74 79 30  7a 74 54 70 6d 48 4c 41  |MN2/dty0ztTpmHLA|\n00000120  38 4f 6e 41 4d 48 43 6a  6b 63 71 59 74 55 35 55  |8OnAMHCjkcqYtU5U|\n00000130  36 47 78 63 74 42 6e 6d  31 70 50 55 2b 64 46 34  |6GxctBnm1pPU+dF4|\n00000140  6a 43 44 6c 57 2f 35 39  46 62 51 3d 3c 2f 48 6f  |jCDlW/59FbQ=</Ho|\n00000150  73 74 49 64 3e 3c 2f 45  72 72 6f 72 3e           |stId></Error>|\n\ncaused by: unknown error response tag, {{ Error} []}"

And if I do not specify null for secretAccessKey and accessKeyId, then it uses the chart values defaults, giving me an obvious 403.

Does one managed to deploy 1.4.3 using IRSA and s3 buckets ? If so, could you please help us pasting your configuration ?

I'm keen to submit a PR to enhance the documentation on this either in this repo or the loki documentation once I'll have it working.

Thanks for those reading this.

cebidhem commented 2 years ago

I've also tried to use the loki.structuredConfig to have a plain Loki config.yaml but the mergeOverride function in the configMap give me the same end result. Which makes me think this function shouldn't not even exist. Either we want/manage to use the loki.config values to build it, either the user should be able to input the full config he wants. Does it make sense ?

Still trying to find a way, but it's kind of frustrating since there is not even a Loki upgrade here, it's really about the Helm chart itself.

trevorwhitney commented 2 years ago

It sounds like the solution here is to set the defaults for secretAccessKey and accessKeyId to null.

trevorwhitney commented 2 years ago

Would you be able to try out this branch: https://github.com/trevorwhitney/helm-charts/tree/null-s3-defaults

cebidhem commented 2 years ago

Hi @trevorwhitney thanks for replying!

Sorry it took me some time, but we deploy our Helm charts through Flux using exclusively Helm Repositories, so I had to package it on my side. Anyway, with your branch and my values (pasting them at the bottom), loki is throwing those authentication errors: loki-write-0:

level=error ts=2022-07-06T09:20:44.026140228Z caller=flush.go:222 org_id=fake msg="failed to flush user" err="WebIdentityErr: failed to retrieve credentials\ncaused by: SerializationError: failed to unmarshal error message\n\tstatus code: 405, request id: \ncaused by: UnmarshalError: failed to unmarshal error message\n\t00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version=\"1|\n00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0\" encoding=\"UT|\n00000020  46 2d 38 22 3f 3e 0a 3c  45 72 72 6f 72 3e 3c 43  |F-8\"?>.<Error><C|\n00000030  6f 64 65 3e 4d 65 74 68  6f 64 4e 6f 74 41 6c 6c  |ode>MethodNotAll|\n00000040  6f 77 65 64 3c 2f 43 6f  64 65 3e 3c 4d 65 73 73  |owed</Code><Mess|\n00000050  61 67 65 3e 54 68 65 20  73 70 65 63 69 66 69 65  |age>The specifie|\n00000060  64 20 6d 65 74 68 6f 64  20 69 73 20 6e 6f 74 20  |d method is not |\n00000070  61 6c 6c 6f 77 65 64 20  61 67 61 69 6e 73 74 20  |allowed against |\n00000080  74 68 69 73 20 72 65 73  6f 75 72 63 65 2e 3c 2f  |this resource.</|\n00000090  4d 65 73 73 61 67 65 3e  3c 4d 65 74 68 6f 64 3e  |Message><Method>|\n000000a0  50 4f 53 54 3c 2f 4d 65  74 68 6f 64 3e 3c 52 65  |POST</Method><Re|\n000000b0  73 6f 75 72 63 65 54 79  70 65 3e 53 45 52 56 49  |sourceType>SERVI|\n000000c0  43 45 3c 2f 52 65 73 6f  75 72 63 65 54 79 70 65  |CE</ResourceType|\n000000d0  3e 3c 52 65 71 75 65 73  74 49 64 3e 48 59 43 43  |><RequestId>HYCC|\n000000e0  30 54 46 59 4b 4a 59 53  39 33 53 50 3c 2f 52 65  |0TFYKJYS93SP</Re|\n000000f0  71 75 65 73 74 49 64 3e  3c 48 6f 73 74 49 64 3e  |questId><HostId>|\n00000100  57 63 59 46 4e 78 77 4c  66 64 46 48 38 6c 51 62  |WcYFNxwLfdFH8lQb|\n00000110  42 53 64 7a 73 32 44 76  61 64 66 74 50 52 6b 71  |BSdzs2DvadftPRkq|\n00000120  36 71 51 61 53 68 4f 4f  54 62 52 56 36 78 4f 62  |6qQaShOOTbRV6xOb|\n00000130  66 74 47 54 74 38 4a 39  64 47 59 64 30 43 4e 4b  |ftGTt8J9dGYd0CNK|\n00000140  44 6c 42 6f 36 38 58 56  41 4f 6b 3d 3c 2f 48 6f  |DlBo68XVAOk=</Ho|\n00000150  73 74 49 64 3e 3c 2f 45  72 72 6f 72 3e           |stId></Error>|\n\ncaused by: unknown error response tag, {{ Error} []}"

loki-read-0:

level=error ts=2022-07-06T09:29:49.960078199Z caller=ruler.go:493 msg="unable to list rules" err="WebIdentityErr: failed to retrieve credentials\ncaused by: SerializationError: failed to unmarshal error message\n\tstatus code: 405, request id: \ncaused by: UnmarshalError: failed to unmarshal error message\n\t00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 22 31  |<?xml version=\"1|\n00000010  2e 30 22 20 65 6e 63 6f  64 69 6e 67 3d 22 55 54  |.0\" encoding=\"UT|\n00000020  46 2d 38 22 3f 3e 0a 3c  45 72 72 6f 72 3e 3c 43  |F-8\"?>.<Error><C|\n00000030  6f 64 65 3e 4d 65 74 68  6f 64 4e 6f 74 41 6c 6c  |ode>MethodNotAll|\n00000040  6f 77 65 64 3c 2f 43 6f  64 65 3e 3c 4d 65 73 73  |owed</Code><Mess|\n00000050  61 67 65 3e 54 68 65 20  73 70 65 63 69 66 69 65  |age>The specifie|\n00000060  64 20 6d 65 74 68 6f 64  20 69 73 20 6e 6f 74 20  |d method is not |\n00000070  61 6c 6c 6f 77 65 64 20  61 67 61 69 6e 73 74 20  |allowed against |\n00000080  74 68 69 73 20 72 65 73  6f 75 72 63 65 2e 3c 2f  |this resource.</|\n00000090  4d 65 73 73 61 67 65 3e  3c 4d 65 74 68 6f 64 3e  |Message><Method>|\n000000a0  50 4f 53 54 3c 2f 4d 65  74 68 6f 64 3e 3c 52 65  |POST</Method><Re|\n000000b0  73 6f 75 72 63 65 54 79  70 65 3e 53 45 52 56 49  |sourceType>SERVI|\n000000c0  43 45 3c 2f 52 65 73 6f  75 72 63 65 54 79 70 65  |CE</ResourceType|\n000000d0  3e 3c 52 65 71 75 65 73  74 49 64 3e 32 47 39 37  |><RequestId>2G97|\n000000e0  30 36 41 38 54 53 48 52  35 54 38 52 3c 2f 52 65  |06A8TSHR5T8R</Re|\n000000f0  71 75 65 73 74 49 64 3e  3c 48 6f 73 74 49 64 3e  |questId><HostId>|\n00000100  36 56 6a 4f 36 62 32 47  78 79 47 65 4a 6b 46 71  |6VjO6b2GxyGeJkFq|\n00000110  6d 35 63 44 51 45 4f 4b  48 73 72 34 58 54 35 68  |m5cDQEOKHsr4XT5h|\n00000120  58 72 39 51 46 69 71 47  6a 62 66 77 42 6e 33 2b  |Xr9QFiqGjbfwBn3+|\n00000130  71 74 78 4d 50 43 6b 71  32 2f 62 54 4f 72 63 56  |qtxMPCkq2/bTOrcV|\n00000140  33 45 69 30 53 74 76 75  37 30 63 3d 3c 2f 48 6f  |3Ei0Stvu70c=</Ho|\n00000150  73 74 49 64 3e 3c 2f 45  72 72 6f 72 3e           |stId></Error>|\n\ncaused by: unknown error response tag, {{ Error} []}"

values.yaml:

    fullnameOverride: loki
    gateway:
      ingress:
        annotations:
          cert-manager.io/cluster-issuer: letsencrypt
          kubernetes.io/tls-acme: "true"
        enabled: true
        hosts:
        - host: loki.mydomain.com
          paths:
          - path: /
            pathType: Prefix
        ingressClassName: ingress-nginx-global
        tls:
        - hosts:
          - loki.mydomain.com
          secretName: loki-gateway-tls
    loki:
      auth_enabled: false
      schemaConfig:
        configs:
        - from: "2020-10-24"
          index:
            period: 24h
            prefix: loki_index_
          object_store: s3
          schema: v11
          store: boltdb-shipper
        - from: "2022-07-10"
          index:
            period: 24h
            prefix: loki_index_
          object_store: s3
          schema: v12
          store: boltdb-shipper
      storage:
        bucketNames:
          chunks: my-company-loki-objstore
          ruler: my-company-loki-objstore
        s3:
          endpoint: https://s3.eu-west-1.amazonaws.com
          insecure: false
          region: eu-west-1
          s3: s3://eu-west-1
          s3ForcePathStyle: false
      storageConfig:
        boltdb_shipper:
          shared_store: s3
    monitoring:
      selfMonitoring:
        enabled: false
        grafanaAgent:
          installOperator: false
      serviceMonitor:
        enabled: true
    read:
      persistence:
        size: 5Gi
        storageClass: ebs-sc
      replicas: 2
    serviceAccount:
      annotations:
        arn:aws:iam::redacted_aws_account_id:role/loki-irsa-role
    write:
      persistence:
        size: 5Gi
        storageClass: ebs-sc
      replicas: 2

I'm also adding the ConfigMap created by the chart:

kind: ConfigMap
apiVersion: v1
data:
  config.yaml: |
    auth_enabled: false
    common:
      path_prefix: /var/loki
      replication_factor: 3
      storage:
        s3:
          access_key_id: null
          bucketnames: my-company-loki-objstore
          endpoint: https://s3.eu-west-1.amazonaws.com
          insecure: false
          region: eu-west-1
          s3: s3://eu-west-1
          s3forcepathstyle: false
          secret_access_key: null
    limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      split_queries_by_interval: 15m
    memberlist:
      join_members:
      - loki-memberlist
    ruler:
      storage:
        s3:
          bucketnames: my-company-loki-objstore
    schema_config:
      configs:
      - from: "2020-10-24"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v11
        store: boltdb-shipper
      - from: "2022-07-10"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v12
        store: boltdb-shipper
    server:
      grpc_listen_port: 9095
      http_listen_port: 3100
cebidhem commented 2 years ago

@trevorwhitney Actually, it just popped in my mind that since Loki version hasn't changed, I should diff the 1.4.3+ and my production (0.4.0) ConfigMaps. Doing this I've been able to narrow down my issue.

I did a few trials and errors, and your MR definitely helps. Defaulting secretAccessKey and accessKeyId to null makes things better. However, I noticed that anytime I set an endpoint in the config - whether it is regional or not, with or without the https:// - I have the same errors than posted in my previous comment. As soon as I remove the endpoint property, everything works as expected. My last test was to have endpoint: null and with this, it also works as expected.

Working ConfigMap:

kind: ConfigMap
apiVersion: v1
data:
  config.yaml: |
    auth_enabled: false
    common:
      path_prefix: /var/loki
      replication_factor: 3
      storage:
        s3:
          bucketnames: my-company-loki-objstore
          s3: s3://eu-west-1
          region: eu-west-1
          access_key_id: null
          secret_access_key: null
          insecure: false
          endpoint: null
    limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      split_queries_by_interval: 15m
    memberlist:
      join_members:
      - loki-memberlist
    ruler:
      storage:
        s3:
          bucketnames: my-company-loki-objstore
    schema_config:
      configs:
      - from: "2020-10-24"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v11
        store: boltdb-shipper
      - from: "2022-07-10"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v12
        store: boltdb-shipper
    server:
      grpc_listen_port: 9095
      http_listen_port: 3100

I would propose to default endpoint to null, wdyt ? I can also propose a PR for this if you'd prefer.

trevorwhitney commented 2 years ago

@cebidhem thanks for testing this out. That sound reasonable to me. I've updated my PR to default endpoint to null, and also to only include non-null s3 properties in the final config. Mind trying out that branch again with the latest changes?

cebidhem commented 2 years ago

Hi @trevorwhitney it works perfectly for me!

Here's the ConfigMap generated:

kind: ConfigMap
apiVersion: v1
data:
  config.yaml: |
    auth_enabled: false
    common:
      path_prefix: /var/loki
      replication_factor: 3
      storage:
        s3:
          bucketnames: my-company-loki-objstore
          insecure: false
          region: eu-west-1
          s3: s3://eu-west-1
          s3forcepathstyle: false
    limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      split_queries_by_interval: 15m
    memberlist:
      join_members:
      - loki-memberlist
    ruler:
      storage:
        s3:
          bucketnames: my-company-loki-objstore
    schema_config:
      configs:
      - from: "2020-10-24"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v11
        store: boltdb-shipper
      - from: "2022-07-10"
        index:
          period: 24h
          prefix: loki_index_
        object_store: s3
        schema: v12
        store: boltdb-shipper
    server:
      grpc_listen_port: 9095
      http_listen_port: 3100

Thanks a lot!

trevorwhitney commented 2 years ago

Glad to hear it!

cebidhem commented 2 years ago

@trevorwhitney Do you think we could have your changes published in a soon to come 1.7.1 fix version ?

Vad1mo commented 2 years ago

I created our setup based on @cebidhem recommendation. Also for AWS s3/IRSA. However I see another error.

level=error ts=2022-07-11T11:25:21.299030878Z caller=ruler.go:493 msg="unable to list rules" err="InvalidParameter: 1 validation error(s) found.\n- minimum field size of 1, ListObjectsV2Input.Bucket.\n"
level=error ts=2022-07-11T11:26:21.599707269Z caller=flush.go:222 org_id=fake msg="failed to flush user" err="InvalidParameter: 1 validation error(s) found.\n- minimum field size of 1, PutObjectInput.Bucket.\n"

We are using the latest helm chart 2.13.1 and loki v2.6.0

Our S3 configs looks like this, I think that should be enough. Because AWS-CLI which is also installed along is not complaining about it.

            common:
              storage:                 
                s3: 
                  s3: s3://{s3.bucket_name}
cebidhem commented 2 years ago

Hi @Vad1mo ,

Are those values or the rendered ConfigMap ?

Vad1mo commented 2 years ago

Yes, we are using the locki-stack chart, and those values are in the secret and mapped into the container as loki.yaml

Thats the whole file:

auth_enabled: false
chunk_store_config:
  max_look_back_period: 0s
common:
  storage:
    s3:
      s3: s3://eks-cluster-core-services-stack-lokilogs1d26bb6a-6cfhcnqlbutz
      s3forcepathstyle: false
compactor:
  shared_store: s3
  working_directory: /data/loki/boltdb-shipper-compactor
ingester:
  chunk_block_size: 262144
  chunk_idle_period: 3m
  chunk_retain_period: 1m
  lifecycler:
    ring:
      replication_factor: 1
  max_transfer_retries: 0
  wal:
    dir: /data/loki/wal
limits_config:
  enforce_metric_name: false
  max_entries_limit_per_query: 5000
  reject_old_samples: true
  reject_old_samples_max_age: 168h
memberlist:
  join_members:
  - loki-memberlist
ruler:
  storage:
    s3:
      s3: s3://eks-cluster-core-services-stack-lokilogs1d26bb6a-6cfhcnqlbutz
      s3forcepathstyle: false
schema_config:
  configs:
  - from: "2022-06-06"
    index:
      period: 24h
      prefix: loki_index_
    object_store: s3
    schema: v12
    store: boltdb-shipper
server:
  grpc_listen_port: 9095
  http_listen_port: 3100
storage_config:
  boltdb_shipper:
    active_index_directory: /data/loki/boltdb-shipper-active
    cache_location: /data/loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: s3
  filesystem:
    directory: /data/loki/chunks
table_manager:
  retention_deletes_enabled: true
  retention_period: 90d
Vad1mo commented 2 years ago

works like a charm now:

describing the bucket with are region did the trick. aws-cli worked as the env var AWS_DEFAULT_REGION was set.


s3: s3://region/bucket_name