Open envyj020 opened 1 year ago
One guess is that since you're using the dns+
syntax, the load balancer is returning IPs from backend instances which are not exposed publicly. Looking at the IPs from your first snippet, they seem to be from a private network.
Maybe you can try to connect directly to the load balancer using --endpoint "querier.dev.example.internal.com:443
. I am also not sure if Thanos supports TLS at the moment, this can be another issue with your setup.
Hi @fpetkovski thanks for answering. Indeed the IP's I get from the error logs are belonging to the underlying IP's behind an internal ALB so I'm not reaching out directly querier but through an ALB with SSL termination instead which forwards the traffic to the corresponding data plane node port.
I've tried your suggestion but without success:
$ thanos query --http-address "0.0.0.0:9090" --endpoint "querier.dev.example.internal.com:443" --log.level=debug
level=info ts=2022-11-24T15:22:32.867223926Z caller=options.go:26 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2022-11-24T15:22:32.868307099Z caller=query.go:724 msg="starting query node"
level=info ts=2022-11-24T15:22:32.868814593Z caller=intrumentation.go:75 msg="changing probe status" status=healthy
level=info ts=2022-11-24T15:22:32.86885173Z caller=http.go:73 service=http/server component=query msg="listening for requests and metrics" address=0.0.0.0:9090
level=info ts=2022-11-24T15:22:32.869306299Z caller=tls_config.go:195 service=http/server component=query msg="TLS is disabled." http2=false
level=info ts=2022-11-24T15:22:32.869403135Z caller=intrumentation.go:56 msg="changing probe status" status=ready
level=info ts=2022-11-24T15:22:32.869465856Z caller=grpc.go:131 service=gRPC/server component=query msg="listening for serving gRPC" address=0.0.0.0:10901
level=warn ts=2022-11-24T15:22:42.873007272Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from querier.dev.example.internal.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=querier.dev.example.internal.com:443
level=warn ts=2022-11-24T15:22:47.874007007Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from querier.dev.example.internal.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=querier.dev.example.internal.com:443
level=warn ts=2022-11-24T15:22:52.875022961Z caller=endpointset.go:416 component=endpointset msg="update of endpoint failed" err="getting metadata: fallback fetching info from querier.dev.example.internal.com:443: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=querier.dev.example.internal.com:443
Hm so where does querier-gp-grpc.beta.internal.stuart
come from? It seems to be different from querier.dev.example.internal.com:443
and it looks like the querier cannot connect to this domain.
That’s was my bad copy/pasting the logs without changing the DNS properly to mask the real DNS, will amend them to avoid any confusion 🙏
This is what I did to resolve this issue,
in observer cluster:
`
query:
# to integrate with external query which support tls only and thanos query doesn't support mixture of tls and non-tls
dnsDiscovery:
enabled: false
sidecarsService: ""
sidecarsNamespace: ""
extraFlags:
- --grpc-client-tls-secure
- --grpc-client-tls-skip-verify
# Add MGMT cluster's Receive, Prometheus Thanos Sidecar and Store Gateway
- --endpoint=grpc.store-gateway.thanos.observer.cluster.local:443
- --endpoint=grpc.receive.thanos.observer.cluster.local:443
- --endpoint=grpc.prom-sidecar.thanos.observer.cluster.local:443
# Add each external cluster's query
- --endpoint=ingress domain: grpc.query.thanos.observee1.cluster.local:443
- --endpoint=ingress domain: grpc.query.thanos.observee2.cluster.local:443
`
Please note that bitnami chart have grpc ingress for all components except receive, so if you are using receive component in observer cluster, u need to created it manually as it's not part of the bitnami chart
Hi @logamanig,
thanos query either support all endpoints either tls or non-tls not both
I didn't know it TBH, I'll give it a try and will get back to here to share it. Thanks!
Hi @envyj020 , any luck?
Hi @logamanig, apologies for the delay and getting back to here. Indeed, switching to TLS sorted out the not so verbose error posted but at the end and in order to simplify the whole setup I finally opted to use external-dns
in the querier service definition to publish directly the cluster IP so I can reach them out directly without going through an ALB as long as your EKS support the VPC CNI plugin.
BUT still I don't understand why in AWS-lingo connecting through an ALB/NLB without SSL termination isn't working neither...
hey all, i was experiencing this same issue, basically it's caused by the the TLS configuration as @logamanig said, you have two options here, you can manage the tls connections between services using something like consul, or you can setup in the observer cluster an ingress to the storegateway, ruler and main prometheus thanos sidecar, something like:
apiVersion: networking.k8s.io/v1 kind: Ingress metadata: namespace: monitoring name: observer-ingress annotations:
external-dns/zone: private
external-dns.alpha.kubernetes.io/hostname: ruler-thanos..example.com storegateway-thanos.example.com, thanos-sidecar-prometheus.example.com
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/subnets: (if required )
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS": 443}]'
alb.ingress.kubernetes.io/backend-protocol-version: GRPC
alb.ingress.kubernetes.io/certificate-arn:
spec: ingressClassName: alb rules:
I was not able to follow the solution provided in this thread. I still have a similar problem though Im deploying a querier to read the sidecar in a different cluster which is exposed using an ingress (tls). Below is the helm values file for deploying the querier. But end up getting same error
existingObjstoreSecret: thanos-object-store
query:
enabled: true
stores:
- "thanos-sidecar.mypublicdomain.com:443"
replicaCount: 2
queryFrontend:
enabled: false
The certificate on thanos-sidecar.mypublicdomain.com:443
is valid but on the querier I still get the error
rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=thanos-sidecar.mypublicdomain.com:443
As a fix I have also tried adding
extraFlags:
- --grpc-client-tls-secure
- --grpc-client-tls-skip-verify
and the error changes to
rpc error: code = Unavailable desc = upstream connect error or disconnect/reset before headers. reset reason: protocol error" address=thanos-...
I haven't tried stack external stores. But stacking external queriers still working fine for me.
@kartik-moolya not sure if still relevant but what I did was setting a remote querier with:
query:
extraFlags:
- --grpc-client-tls-skip-verify
grpc:
client:
tls:
enabled: true
existingSecret:
name: thanos-client-secret
keyMapping:
ca-cert: ca.pem
tls-cert: cert.pem
tls-key: key.pem
I'm using an ExternalSecret
to pull thanos-client-secret
.
Then on obeservee' side:
query:
stores:
- dns+prom-prometheus-thanos:10901
enabled: true
ingress:
enabled: true
hostname: "thanos.myhostname.com"
tls: true
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/backend-protocol: "GRPC"
ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/auth-tls-secret: monitoring/thanos-ca-secret
nginx.ingress.kubernetes.io/auth-tls-verify-client: "true"
Same as before, thanos-ca-secret
is pulled by using an ExternalSecret
object.
You can generate the secret with the following commands:
# generate client cert and key
openssl req -new -newkey rsa:4096 -nodes -keyout client.key -out client.csr
openssl x509 -req -sha256 -days 365 -in client.csr -CA ca.crt -CAkey ca.key -set_serial 02 -out client.crt
I have the setup you describe working correctly. I have a global thanos query pointed and leaf queriers that are exposed through an ALB.
The main things to check on the ALB is the health check path and response codes.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
alb.ingress.kubernetes.io/backend-protocol-version: GRPC
alb.ingress.kubernetes.io/group.name: internal-shared
alb.ingress.kubernetes.io/healthcheck-path: /grpc.health.v1.Health/Check
alb.ingress.kubernetes.io/healthcheck-protocol: HTTP
alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS": 10901}]'
alb.ingress.kubernetes.io/load-balancer-attributes: idle_timeout.timeout_seconds=120
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/ssl-redirect: "10901"
alb.ingress.kubernetes.io/success-codes: "0"
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=10,load_balancing.algorithm.type=least_outstanding_requests
alb.ingress.kubernetes.io/target-type: ip
labels:
app.kubernetes.io/component: query-layer
app.kubernetes.io/instance: thanos-query
app.kubernetes.io/name: thanos-query
app.kubernetes.io/version: v0.36.1
kustomize.toolkit.fluxcd.io/name: thanos
kustomize.toolkit.fluxcd.io/namespace: flux-system
name: thanos-query
namespace: monitoring
spec:
ingressClassName: aws
rules:
- host: thanos-query-internal.example.com
http:
paths:
- backend:
service:
name: thanos-query
port:
number: 10901
path: /
pathType: Prefix
tls:
- hosts:
- thanos-query-internal.example.com
secretName: thanos-q-tls
status:
Versions used:
Prometheus: v2.38.0 Thanos: 0.28.0
Environment:
AWS EKS
Issue description:
Trying to stack external queriers from a centralized querier meant to be our single entrypoint to observe other kubernetes clusters, I have exposed the GRPC endpoint with a combination of external-dns and alb-ingress-controller, everything is showing healthy from outside, I can even connect to the GRPC endpoint from outside the cluster:
The following Ingress and service definition is used:
GRPC Ingress:
GRPC Service:
Error logs:
But if I try to connect directly to the NodeIP:NodePort exposed, everything works:
I'd really appreaciate any ideas of what am I missing here, thanks in advance!
Originally posted by @envyj020 in https://github.com/thanos-io/thanos/discussions/5916