thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.08k stars 2.09k forks source link

Querier: Unable to stack external queriers #5918

Open envyj020 opened 1 year ago

envyj020 commented 1 year ago

Versions used:

Prometheus: v2.38.0 Thanos: 0.28.0

Environment:

AWS EKS

Issue description:

Trying to stack external queriers from a centralized querier meant to be our single entrypoint to observe other kubernetes clusters, I have exposed the GRPC endpoint with a combination of external-dns and alb-ingress-controller, everything is showing healthy from outside, I can even connect to the GRPC endpoint from outside the cluster:

$ grpcurl querier.dev.example.internal.com:443 list

grpc.health.v1.Health
grpc.reflection.v1alpha.ServerReflection
thanos.Exemplars
thanos.Metadata
thanos.Query
thanos.Rules
thanos.Store
thanos.Targets
thanos.info.Info

The following Ingress and service definition is used:

GRPC Ingress:

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/backend-protocol: HTTP
    alb.ingress.kubernetes.io/backend-protocol-version: GRPC
    alb.ingress.kubernetes.io/group.name: example-dev
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS": 443}]'
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/subnets: subnet-xxx, subnet-yyy, subnet-zzz
    alb.ingress.kubernetes.io/target-type: instance
    external-dns.alpha.kubernetes.io/aws-zone-type-private: "true"
    external-dns.alpha.kubernetes.io/hostname: querier.dev.example.internal.com
    kubernetes.io/ingress.class: alb
  name: thanos-querier-grpc-ingress
spec:
  rules:
  - host: querier.dev.example.internal.com
    http:
      paths:
      - backend:
          service:
            name: thanos-query-svc
            port:
              number: 10901
        path: /
        pathType: Prefix

GRPC Service:

---
apiVersion: v1
kind: Service
metadata:
  labels:
    app.kubernetes.io/component: query
    app.kubernetes.io/instance: thanos
    app.kubernetes.io/name: thanos
  name: thanos-query-svc
spec:
  ports:
    - name: grpc
      port: 10901
      protocol: TCP
      targetPort: grpc
  selector:
    app.kubernetes.io/component: query
    app.kubernetes.io/instance: thanos
    app.kubernetes.io/name: thanos
  type: NodePort

Error logs:

$ thanos query --http-address "0.0.0.0:9090" --endpoint "dns+querier.dev.example.internal.com:443" --log.level=debug

level=info ts=2022-11-22T11:41:59.407630043Z caller=options.go:26 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2022-11-22T11:41:59.408583284Z caller=query.go:724 msg="starting query node"
level=info ts=2022-11-22T11:41:59.408932445Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2022-11-22T11:41:59.408995163Z caller=http.go:73 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:9090
level=info ts=2022-11-22T11:41:59.409619596Z caller=tls_config.go:195 service=http/server component=query msg="TLS is disabled." http2=false
level=info ts=2022-11-22T11:41:59.414617708Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2022-11-22T11:41:59.415370892Z caller=grpc.go:131 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=warn ts=2022-11-22T11:42:09.41321717Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from 10.31.80.62:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.31.80.62:443
level=warn ts=2022-11-22T11:42:09.413217173Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from 10.31.82.142:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.31.82.142:443
level=warn ts=2022-11-22T11:42:09.413258171Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from 10.31.87.110:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.31.87.110:443
level=warn ts=2022-11-22T11:42:14.413571566Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from 10.31.82.142:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.31.82.142:443
level=warn ts=2022-11-22T11:42:14.413843036Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from 10.31.87.110:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.31.87.110:443
level=warn ts=2022-11-22T11:42:14.413971731Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from 10.31.80.62:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.31.80.62:443

But if I try to connect directly to the NodeIP:NodePort exposed, everything works:

$ thanos query --http-address "0.0.0.0:9090" --endpoint "10.31.82.37:32654" --log.level=debug

level=info ts=2022-11-22T12:24:52.920908263Z caller=options.go:26 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2022-11-22T12:24:52.924620011Z caller=query.go:724 msg="starting query node"
level=info ts=2022-11-22T12:24:52.926443089Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2022-11-22T12:24:52.928087156Z caller=http.go:73 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:9090
level=info ts=2022-11-22T12:24:52.930298428Z caller=tls_config.go:195 service=http/server component=query msg="TLS is disabled." http2=false
level=info ts=2022-11-22T12:24:52.929251491Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2022-11-22T12:24:52.930634844Z caller=grpc.go:131 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=info ts=2022-11-22T12:24:52.932686726Z caller=endpointset.go:381 component=endpointset msg="adding new query with [storeAPI rulesAPI exemplarsAPI targetsAPI MetricMetadataAPI QueryAPI]" address=10.31.82.37:32654 extLset="{cluster=\"example-dev\", purpose=\"kubernetes\", replica=\"prometheus-prometheus-kube-prometheus-0\"},{cluster=\"example-dev\", purpose=\"kubernetes\", replica=\"prometheus-prometheus-kube-prometheus-1\"}"

I'd really appreaciate any ideas of what am I missing here, thanks in advance!

Originally posted by @envyj020 in https://github.com/thanos-io/thanos/discussions/5916

fpetkovski commented 1 year ago

One guess is that since you're using the dns+ syntax, the load balancer is returning IPs from backend instances which are not exposed publicly. Looking at the IPs from your first snippet, they seem to be from a private network.

Maybe you can try to connect directly to the load balancer using --endpoint "querier.dev.example.internal.com:443. I am also not sure if Thanos supports TLS at the moment, this can be another issue with your setup.

envyj020 commented 1 year ago

Hi @fpetkovski thanks for answering. Indeed the IP's I get from the error logs are belonging to the underlying IP's behind an internal ALB so I'm not reaching out directly querier but through an ALB with SSL termination instead which forwards the traffic to the corresponding data plane node port.

I've tried your suggestion but without success:

$ thanos query --http-address "0.0.0.0:9090" --endpoint "querier.dev.example.internal.com:443" --log.level=debug
level=info ts=2022-11-24T15:22:32.867223926Z caller=options.go:26 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2022-11-24T15:22:32.868307099Z caller=query.go:724 msg="starting query node"
level=info ts=2022-11-24T15:22:32.868814593Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2022-11-24T15:22:32.86885173Z caller=http.go:73 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:9090
level=info ts=2022-11-24T15:22:32.869306299Z caller=tls_config.go:195 service=http/server component=query msg="TLS is disabled." http2=false
level=info ts=2022-11-24T15:22:32.869403135Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2022-11-24T15:22:32.869465856Z caller=grpc.go:131 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=warn ts=2022-11-24T15:22:42.873007272Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from querier.dev.example.internal.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=querier.dev.example.internal.com:443
level=warn ts=2022-11-24T15:22:47.874007007Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from querier.dev.example.internal.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=querier.dev.example.internal.com:443
level=warn ts=2022-11-24T15:22:52.875022961Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from querier.dev.example.internal.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=querier.dev.example.internal.com:443
fpetkovski commented 1 year ago

Hm so where does querier-gp-grpc.beta.internal.stuart come from? It seems to be different from querier.dev.example.internal.com:443 and it looks like the querier cannot connect to this domain.

envyj020 commented 1 year ago

That’s was my bad copy/pasting the logs without changing the DNS properly to mask the real DNS, will amend them to avoid any confusion 🙏

logamanig commented 1 year ago

This is what I did to resolve this issue,

in observer cluster:

`

observer cluster thanos query values.yaml

query:

disabled all dnsDiscovery for mgmt cluster as it's required to use tls endpoint

# to integrate with external query which support tls only and thanos query doesn't support mixture of tls and non-tls
dnsDiscovery:
  enabled: false
  sidecarsService: ""
  sidecarsNamespace: ""
extraFlags:
    - --grpc-client-tls-secure
    - --grpc-client-tls-skip-verify
    # Add MGMT cluster's Receive, Prometheus Thanos Sidecar and Store Gateway
    - --endpoint=grpc.store-gateway.thanos.observer.cluster.local:443
    - --endpoint=grpc.receive.thanos.observer.cluster.local:443
    - --endpoint=grpc.prom-sidecar.thanos.observer.cluster.local:443
    # Add each external cluster's query
    - --endpoint=ingress domain: grpc.query.thanos.observee1.cluster.local:443
    - --endpoint=ingress domain: grpc.query.thanos.observee2.cluster.local:443
  `
logamanig commented 1 year ago

Please note that bitnami chart have grpc ingress for all components except receive, so if you are using receive component in observer cluster, u need to created it manually as it's not part of the bitnami chart

envyj020 commented 1 year ago

Hi @logamanig,

thanos query either support all endpoints either tls or non-tls not both

I didn't know it TBH, I'll give it a try and will get back to here to share it. Thanks!

logamanig commented 1 year ago

Hi @envyj020 , any luck?

envyj020 commented 1 year ago

Hi @logamanig, apologies for the delay and getting back to here. Indeed, switching to TLS sorted out the not so verbose error posted but at the end and in order to simplify the whole setup I finally opted to use external-dns in the querier service definition to publish directly the cluster IP so I can reach them out directly without going through an ALB as long as your EKS support the VPC CNI plugin.

BUT still I don't understand why in AWS-lingo connecting through an ALB/NLB without SSL termination isn't working neither...

s709t commented 1 year ago

hey all, i was experiencing this same issue, basically it's caused by the the TLS configuration as @logamanig said, you have two options here, you can manage the tls connections between services using something like consul, or you can setup in the observer cluster an ingress to the storegateway, ruler and main prometheus thanos sidecar, something like:

apiVersion: networking.k8s.io/v1 kind: Ingress metadata: namespace: monitoring name: observer-ingress annotations:

kubernetes.io/ingress.class: alb

  external-dns/zone: private
  external-dns.alpha.kubernetes.io/hostname: ruler-thanos..example.com storegateway-thanos.example.com, thanos-sidecar-prometheus.example.com
  alb.ingress.kubernetes.io/target-type: ip
  alb.ingress.kubernetes.io/scheme: internal
  alb.ingress.kubernetes.io/subnets: (if required )
  alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS": 443}]'
  alb.ingress.kubernetes.io/backend-protocol-version: GRPC
  alb.ingress.kubernetes.io/certificate-arn: 

spec: ingressClassName: alb rules:

kartik-moolya commented 1 year ago

I was not able to follow the solution provided in this thread. I still have a similar problem though Im deploying a querier to read the sidecar in a different cluster which is exposed using an ingress (tls). Below is the helm values file for deploying the querier. But end up getting same error

existingObjstoreSecret: thanos-object-store
query:
  enabled: true
  stores:
  - "thanos-sidecar.mypublicdomain.com:443"
  replicaCount: 2
queryFrontend:
  enabled: false

The certificate on thanos-sidecar.mypublicdomain.com:443 is valid but on the querier I still get the error

rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=thanos-sidecar.mypublicdomain.com:443 

As a fix I have also tried adding

  extraFlags:
  - --grpc-client-tls-secure
  - --grpc-client-tls-skip-verify

and the error changes to

rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: protocol error" address=thanos-...
logamanig commented 1 year ago

I haven't tried stack external stores. But stacking external queriers still working fine for me.

Shaked commented 8 months ago

@kartik-moolya not sure if still relevant but what I did was setting a remote querier with:

    query:
      extraFlags:
        - --grpc-client-tls-skip-verify
      grpc:
        client:
          tls:
            enabled: true
            existingSecret:
              name: thanos-client-secret
              keyMapping:
                ca-cert: ca.pem
                tls-cert: cert.pem
                tls-key: key.pem

I'm using an ExternalSecret to pull thanos-client-secret.

Then on obeservee' side:

    query:
        stores:
          - dns+prom-prometheus-thanos:10901
        enabled: true
        ingress:
          enabled: true
          hostname: "thanos.myhostname.com"
          tls: true
          annotations:
            kubernetes.io/ingress.class: nginx
            nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
            ingress.kubernetes.io/ssl-redirect: "true"
            nginx.ingress.kubernetes.io/auth-tls-secret: monitoring/thanos-ca-secret
            nginx.ingress.kubernetes.io/auth-tls-verify-client: "true"

Same as before, thanos-ca-secret is pulled by using an ExternalSecret object.

You can generate the secret with the following commands:

# generate client cert and key
openssl req -new -newkey rsa:4096 -nodes -keyout client.key -out client.csr
openssl x509 -req -sha256 -days 365 -in client.csr -CA ca.crt -CAkey ca.key -set_serial 02 -out client.crt
dschaaff commented 2 days ago

I have the setup you describe working correctly. I have a global thanos query pointed and leaf queriers that are exposed through an ALB.

The main things to check on the ALB is the health check path and response codes.

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    alb.ingress.kubernetes.io/backend-protocol-version: GRPC
    alb.ingress.kubernetes.io/group.name: internal-shared
    alb.ingress.kubernetes.io/healthcheck-path: /grpc.health.v1.Health/Check
    alb.ingress.kubernetes.io/healthcheck-protocol: HTTP
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS": 10901}]'
    alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=120
    alb.ingress.kubernetes.io/scheme: internal
    alb.ingress.kubernetes.io/ssl-redirect: "10901"
    alb.ingress.kubernetes.io/success-codes: "0"
    alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=10,load_balancing.algorithm.type=least_outstanding_requests
    alb.ingress.kubernetes.io/target-type: ip
  labels:
    app.kubernetes.io/component: query-layer
    app.kubernetes.io/instance: thanos-query
    app.kubernetes.io/name: thanos-query
    app.kubernetes.io/version: v0.36.1
    kustomize.toolkit.fluxcd.io/name: thanos
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: thanos-query
  namespace: monitoring
spec:
  ingressClassName: aws
  rules:
  - host: thanos-query-internal.example.com
    http:
      paths:
      - backend:
          service:
            name: thanos-query
            port:
              number: 10901
        path: /
        pathType: Prefix
  tls:
  - hosts:
    - thanos-query-internal.example.com
    secretName: thanos-q-tls
status: