thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

Trying to add observee cluster, but observer thanos query cannot discover external thanos-discovery sidecar #6958

Open nessa829 opened 7 months ago

nessa829 commented 7 months ago

Thanos, Prometheus and Golang version used: docker.io/bitnami/thanos:0.31.0-scratch-r8

Object Storage Provider: Amazon s3

What happened: I am trying to add an thanos sidecar from another eks cluster(Cluster B) to the thanos query store(Cluster A).

in Cluster A, I used the helm chart (kube-prometheus-stack:47.3.0), and expose the thanos sidecar with alb lb controller ingress.

  # Ingress exposes thanos sidecar outside the cluster
  thanosIngress:
    enabled: true
    ingressClassName: alb

    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
      alb.ingress.kubernetes.io/load-balancer-name: alpha-prometheus-alb-ingress
      alb.ingress.kubernetes.io/backend-protocol: HTTP
      alb.ingress.kubernetes.io/backend-protocol-version: GRPC
      alb.ingress.kubernetes.io/group.name: prometheus-alpha
      alb.ingress.kubernetes.io/target-type: 'ip'
      alb.ingress.kubernetes.io/scheme: internet-facing
      alb.ingress.kubernetes.io/security-groups: sg-xxxxxxxxx
      alb.ingress.kubernetes.io/manage-backend-security-group-rules: "true"
      alb.ingress.kubernetes.io/subnets: subnet-xxxxxxxx, subnet-xxxxxxxx
      alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
      alb.ingress.kubernetes.io/healthcheck-path: /-/healthy
      alb.ingress.kubernetes.io/certificate-arn: <ACM ARN>
    labels: {}
    # servicePort: 10901

    ## Port to expose on each node
    ## Only used if service.type is 'NodePort'
    ##
    nodePort: 30901

    ## Hosts must be provided if Ingress is enabled.
    ##
    hosts:
      - thanos-sc-alpha.alpha.example.in
    # - thanos-gateway.domain.com

    ## Paths to use for ingress rules
    ##
    paths:
      - /*
    # - /

    ## For Kubernetes >= 1.18 you should specify the pathType (determines how Ingress paths should be matched)
    ## See https://kubernetes.io/blog/2020/04/02/improvements-to-the-ingress-api-in-kubernetes-1.18/#better-path-matching-with-path-types
    pathType: ImplementationSpecific

    ## TLS configuration for Thanos Ingress
    ## Secret must be manually created in the namespace
    ##
    tls:
      - secretName: thanos-gateway-tls
        hosts:
          - thanos-sc-alpha.alpha.example.in
    #

After the installtion, i was able to access the grpc with grpcurl.

$ grpcurl thanos-sc-alpha.alpha.example.in:443 list
grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
thanos.Exemplars
thanos.Metadata
thanos.Rules
thanos.Store
thanos.Targets
thanos.info.Info

$ grpcurl thanos-sc-alpha.alpha.example.in:443 list grpc.health.v1.Health
grpc.health.v1.Health.Check
grpc.health.v1.Health.Watch

$ grpcurl thanos-sc-alpha.alpha.example.in:443 grpc.health.v1.Health.Check
{
  "status": "SERVING"
}

However, my thanos-query in Cluster B cannot discover the sidecar.

query:
  replicaCount: 1
  extraFlags: []
  stores:
    - prometheus-inhouse-kube-pr-thanos-discovery:10901 =====> local thanos sidecar (in Cluster A) works .
    - dns+thanos-sc-alpha.alpha.example.in:443 ===============> external thanos sidecar not discovered
level=info ts=2023-12-04T10:33:56.058563173Z caller=options.go:26 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2023-12-04T10:33:56.058957254Z caller=query.go:840 msg="starting query node"
level=info ts=2023-12-04T10:33:56.059286219Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2023-12-04T10:33:56.059301679Z caller=http.go:73 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2023-12-04T10:33:56.059411941Z caller=tls_config.go:232 service=http/server component=query msg="Listening on" address=[::]:10902
level=info ts=2023-12-04T10:33:56.059442163Z caller=tls_config.go:235 service=http/server component=query msg="TLS is disabled." http2=false address=[::]:10902
level=info ts=2023-12-04T10:33:56.059466435Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2023-12-04T10:33:56.059483879Z caller=grpc.go:131 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=warn ts=2023-12-04T10:34:06.064023703Z caller=endpointset.go:451 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from x.x.x.x:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=x.x.x.x:443
level=warn ts=2023-12-04T10:34:06.064142552Z caller=endpointset.go:451 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from x.x.x.x:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=3.36.218.184:443
......

FYI, the alb security group is open to thanos query, as welll as my local laptop.

What you expected to happen: Thanos query should be able to discover external sidecar, which is exposed by aws ALB grpc.

How to reproduce it (as minimally and precisely as possible):

as above.

Full logs to relevant components: as above.

Anything else we need to know:

nessa829 commented 7 months ago

For more information, thanos query's args:

      args:
        - query
        - '--log.level=info'
        - '--log.format=logfmt'
        - '--grpc-address=0.0.0.0:10901'
        - '--http-address=0.0.0.0:10902'
        - '--query.replica-label=replica'
        - >-
          --endpoint=dnssrv+_grpc._tcp.thanos-query-xxxx-storegateway.monitoring.svc.cluster.local
        - >-
          --endpoint=dnssrv+_grpc._tcp.thanos-query-xxxx-ruler.monitoring.svc.cluster.local
        - '--endpoint=prometheus-inhouse-kube-pr-thanos-discovery:10901'
        - '--endpoint=dns+thanos-sc-alpha.alpha.example.in:443'
        - '--grpc-client-server-name=thanos-sc-alpha.alpha.example.in'

I have tried grpc.server.tls.enable : true or grpc.client.tls.enable : true or both, but nothing was successful...

Also, i have gone through similar issues, also nothing was successful ;( (i.e. --grpc-client-tls-secure)

KM3dd commented 3 months ago

Hello @nessa829 were you able to fix that ?

nessa829 commented 3 months ago

@KM3dd Hi, i changed it to create nlb instead (service type: loadbalancer) of ALB, and it worked.

KM3dd commented 3 months ago

@nessa829 thank you for your response, that's what I am rying to do but I am new to that so I got stuck, meaning you kept using nginx but service type is loadbalancer or you exposed the service directly and used the external address ip ? thank you again

nessa829 commented 3 months ago

@KM3dd I disabled thanosIngress and enabled thanosServiceExternal instead

                thanosServiceExternal:
                  annotations:
                    service.beta.kubernetes.io/aws-load-balancer-name: "thanos-sc-lb"
                    service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"
                    service.beta.kubernetes.io/aws-load-balancer-type: "external"
                    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: "ip"
                    service.beta.kubernetes.io/aws-load-balancer-subnets: {{prometheus.subnet}}
KM3dd commented 3 months ago

@nessa829 thank you very much for your help

danielstankw commented 1 week ago

@nessa829 would you mind to write a brief description on how you solved the issue?

  1. I understand that you used LoadBalancer instead of Ingress?
  2. Also, when do you use thanosService vs thanosServiceExternal? If the Cluster A is the "master" cluster woulndnt it be appropiate to define the thanosService there and only on the observer (slave) clusters to set up thanosServiceExternal?
  3. Especially curious on your thanos query's args