Closed kbristow closed 1 year ago
Readiness/Liveness probe don't check certificate verification .
You should curl with --insecure
option, and see what happen.
Normally it should be like this:
# port-forwarding...
> curl --insecure https://localhost:8080/healthz
{"status":"SERVING"}
If not, I suspect something else went wrong, e.g. configuration problem.
I have indeed used the insecure option for curl. It looks like the actual go http server is returning the body i showed because the certificate loaded does not have localhost as a dns name. In the example above are you using a cert generated using the instructions here or are you also including localhost in the subject alternate names? As in my example, adding localhost to the dns names makes everything work, it just doesnt seem to match up with the instructions.
In the example above are you using a cert generated using the instructions here or are you also including localhost in the subject alternate names?
I followed the instructions, which do not include localhost in the subject alternate names.
It looks like the actual go http server
Yes, I think so after carefully reading the error message.
Have no idea right now. Maybe you could turn log level to debug
to see what is happening in Results API Server .
I have followed steps above but failed to reproduce {"code":14,"message":"connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for tekton-results-api-service.tekton-pipelines.svc.cluster.local, not localhost\"","details":[]}
behavior.
> k -n tekton-pipelines get certificate tekton-results-tls-cert -o yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
creationTimestamp: "2023-07-12T12:28:47Z"
generation: 1
name: tekton-results-tls-cert
namespace: tekton-pipelines
resourceVersion: "2167048"
uid: 02b9f5ef-802d-49a4-b91d-c5c8156b5db2
spec:
dnsNames:
- tekton-results-api-service.tekton-pipelines.svc.cluster.local
issuerRef:
kind: Issuer
name: tekton-results-selfsigned-issuer
secretName: tekton-results-tls
status:
conditions:
- lastTransitionTime: "2023-07-12T12:28:47Z"
message: Certificate is up to date and has not expired
observedGeneration: 1
reason: Ready
status: "True"
type: Ready
notAfter: "2023-10-10T12:28:47Z"
notBefore: "2023-07-12T12:28:47Z"
renewalTime: "2023-09-10T12:28:47Z"
revision: 1
# after kubectl port-forward -n tekton-pipelines service/tekton-results-api-service 8080
> curl --insecure https://localhost:8080/healthz
{"status":"SERVING"}
> k -n tekton-pipelines get pod tekton-results-api-59c8c69744-b7xkl
NAME READY STATUS RESTARTS AGE
tekton-results-api-59c8c69744-b7xkl 1/1 Running 0 4m9s
Results version: v0.7.0
Could you please post Results API Server logs?
kubectl -n tekton-pipelines edit cm tekton-results-config-logging
set as below
zap-logger-config: |
{
"level": "debug",
"development": true,
.....
}
2. curl again
3. check Results API Server logs
So interestingly, if i port forward to the service as you have done, I do not get the error. If I port forward to the pod, i do.
So example of what I was doing which returns a 503 code and the error response I showed earlier:
kubectl port-forward -n tekton-pipelines pods/tekton-results-api-57845986d9-8lrfw 8080:8080
Regardless, I bumped the logging level, and I tried again. The curls do not actually correlate to anything in the logs. There are a bunch of tls errors which seem to match the cadence of the probes:
2023/07/12 13:10:45 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
{"level":"warn","ts":1689167445.1729739,"caller":"api/main.go:116","msg":"Kubernetes RBAC authorization check disabled - all requests will be allowed by the API server"}
{"level":"info","ts":1689167445.3304975,"caller":"api/main.go:195","msg":"Prometheus server listening on: 9090"}
{"level":"info","ts":1689167445.3309426,"caller":"api/main.go:237","msg":"gRPC and REST server listening on: 8080"}
2023/07/12 13:10:54 http: TLS handshake error from 127.0.0.1:34026: remote error: tls: bad certificate
2023/07/12 13:11:04 http: TLS handshake error from 127.0.0.1:33524: remote error: tls: bad certificate
2023/07/12 13:11:09 http: TLS handshake error from 127.0.0.1:33542: remote error: tls: bad certificate
2023/07/12 13:11:14 http: TLS handshake error from 127.0.0.1:44222: remote error: tls: bad certificate
2023/07/12 13:11:24 http: TLS handshake error from 127.0.0.1:33502: remote error: tls: bad certificate
Here the events shown from a describe on the pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m20s default-scheduler Successfully assigned tekton-pipelines/tekton-results-api-57845986d9-8lrfw to ip-10-123-123-12.af-south-1.compute.internal
Normal Pulled 4m40s (x2 over 6m19s) kubelet Container image "gcr.io/tekton-releases/github.com/tektoncd/results/cmd/api:v0.7.0@sha256:08aea538dc699c27715523fc89115cc805a0b5838c924868907189fcc99099b9" already present on machine
Normal Created 4m40s (x2 over 6m19s) kubelet Created container api
Normal Killing 4m40s kubelet Container api failed startup probe, will be restarted
Normal Started 4m39s (x2 over 6m19s) kubelet Started container api
Warning Unhealthy 70s (x31 over 6m10s) kubelet Startup probe failed: HTTP probe failed with statuscode: 503
Note the 503 - which assumed meant the result was the same as that from the curl I was doing.
I dig a little.
Results API Server took a little trick to server as GRPC server and RESTful server simultaneously: on port 8080, it register both grpc service and http service and process request base on http protocol version and Content-Type
.
So one health check flow like this: (1)Kubelet probe -> (2)Results Server RESTful service -> (3)Results Server GRPC service.
tls: bad certificate
?Problem occur when (2)->(3) happened. Results Server need to dial itself and forward this health check request with in GRPC format. Because original http request do not have authority header and the path Kubelet request is simply http://localhost:8080
, localhost
does not match to tekton-results-api-service.tekton-pipelines.svc.cluster.local
in TLS configuration, so TLS handshake failed.
Add environment variable TLS_HOSTNAME_OVERRIDE
valued as tekton-results-api-service.tekton-pipelines.svc.cluster.local
to Results API Server pod configuration. See this code
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
namespace: tekton-pipelines
spec:
serviceAccountName: api
containers:
- name: api
image: ....
env:
- name: TLS_HOSTNAME_OVERRIDE
value: tekton-results-api-service.tekton-pipelines.svc.cluster.local
@sayan-biswas Is here something wrong? As document says:
TLS_HOSTNAME_OVERRIDE Override the hostname used to serve TLS. This should not be set (or set to the empty string) in production environments.
But it seems even in production environment, TLS_HOSTNAME_OVERRIDE
need to be set to make liveness probe working.
authority
header to liveness/readiness HTTP probe when TLS is enabled.I always use e2e test configuration so I did not reproduce the error earlier. E2e test configuration add TLS_HOSTNAME_OVERRIDE by default
@xinnjie This is not happening because of the multiplexing of the gRPC and REST traffic to one port. But this is happening because the REST service in this is a proxy to the gRPC server, which is provide by the package gRPC gateway. So, in code generated by the gRPC gateway I passed an instance of the gRPC client conn.
Ad if I leave that field blank it will default to localhost
and the certificate generated externally only contains the FQDN of the service in CNAME. That why when @kbristow as both the CNAMEs it was working. In production environment this will be behind an ingress which should re-encrypt the traffic with the generated certificate.
And I also think the document is a little unclear or outdated.
@kbristow The health probe logs are filtered out here. https://github.com/tektoncd/results/blob/e85794158c43defca77deb2b63201497dca9f45d/cmd/api/main.go#L146
Because it will spam the log. But maybe for debug we can enable this.
@kbristow The health probe logs are filtered out here.
Because it will spam the log. But maybe for debug we can enable this.
@sayan-biswas
How to explain 2023/07/12 13:10:54 http: TLS handshake error from 127.0.0.1:34026: remote error: tls: bad certificate
this kind of log still exist?
Looks like somewhere of http server do not use zap logger.
@xinnjie This is not happening because of the multiplexing of the gRPC and REST traffic to one port. But this is happening because the REST service in this is a proxy to the gRPC server, which is provide by the package gRPC gateway. So, in code generated by the gRPC gateway I passed an instance of the gRPC client conn.
Ad if I leave that field blank it will default to
localhost
and the certificate generated externally only contains the FQDN of the service in CNAME. That why when @kbristow as both the CNAMEs it was working. In production environment this will be behind an ingress which should re-encrypt the traffic with the generated certificate.And I also think the document is a little unclear or outdated.
@sayan-biswas Thanks for your explanations. I understood in production env it works fine now. What I am afraid is that a user wants to try out Results demo but encounter this"pod is being unhealthy constantly" behavior. Something needs to be done here.
@kbristow The health probe logs are filtered out here. https://github.com/tektoncd/results/blob/e85794158c43defca77deb2b63201497dca9f45d/cmd/api/main.go#L146
Because it will spam the log. But maybe for debug we can enable this.
@sayan-biswas How to explain
2023/07/12 13:10:54 http: TLS handshake error from 127.0.0.1:34026: remote error: tls: bad certificate
this kind of log still exist? Looks like somewhere of http server do not use zap logger.
Yes, this comes from the http internal libraries IG.
@xinnjie This is not happening because of the multiplexing of the gRPC and REST traffic to one port. But this is happening because the REST service in this is a proxy to the gRPC server, which is provide by the package gRPC gateway. So, in code generated by the gRPC gateway I passed an instance of the gRPC client conn. https://github.com/tektoncd/results/blob/e85794158c43defca77deb2b63201497dca9f45d/cmd/api/main.go#L213
Ad if I leave that field blank it will default to
localhost
and the certificate generated externally only contains the FQDN of the service in CNAME. That why when @kbristow as both the CNAMEs it was working. In production environment this will be behind an ingress which should re-encrypt the traffic with the generated certificate. And I also think the document is a little unclear or outdated.@sayan-biswas Thanks for your explanations. I understood in production env it works fine now. What I am afraid is that a user wants to try out Results demo but encounter this"pod is being unhealthy constantly" behavior. Something needs to be done here.
In that case I guess we can default the TLS_HOSTNAME_OVERRIDE
to tekton-results-api-service.tekton-pipelines.svc.cluster.local
Yes, agreed.
@sayan-biswas Sorry~ I am still wondering after thinking twice. In production env without meshing service, how could the request from RESTful service to GRPC service to be re-encrypted with address being "localhost" and without authority header
Add environment variable
TLS_HOSTNAME_OVERRIDE
valued astekton-results-api-service.tekton-pipelines.svc.cluster.local
to Results API Server pod configuration
@xinnjie Thanks. I did see this environment variable but like you say, it says not for use in production and the only use I saw in the repo itself was for the e2e tests. The documentation is certainly not clear on how to use it/when to use it. I will use for my setup for now though.
In production environment this will be behind an ingress which should re-encrypt the traffic with the generated certificate.
@sayan-biswas Im not sure I understand how having the service behind an ingress will help with the k8s probes, which have no knowledge of that ingress.
@kbristow Health probe is not related to ingress comment. What I meant is normally the service will be used in that way. So the cert manager can generate certificate with have multiple cnames and the ingress can use that cert to re-encrypt the traffic. We have faced the same issue, but not with health probes, but for normally usage of the REST APIs. If we leave the tls override blank, it will default to 'localhost' in the internal libraries and we have no control over that.
Since the TLS is already validated in the HTTP listener, there's no need to validate it again at the gRPC server listener. So we can stop verifying the authority hearder there.
I'll try this out and open a PR if it works, else I'll default the TLS_HOSTNAME_OVERRIDE config to the service FQDN and update the document.
Thanks. I understand.
@kbristow Can you checkout PR #536 and see if that solves your issue?
I will have a look when I can 👍
Tested on local machine. It fixed the problem as expected.
before:
Results API Server complains about TLS handshake error from 127.0.0.1:35826: remote error: tls: bad certificate
> kubectl kustomize config/overlays/logs-local-db" | ko delete -f -
> kubectl kustomize config/overlays/logs-local-db" | ko apply -f -
... apply successfully
> k -n tekton-pipelines get pods
NAME READY STATUS RESTARTS AGE
tekton-dashboard-69bfd6b5ff-b4pbc 1/1 Running 0 45h
tekton-events-controller-697685d976-6kmqw 1/1 Running 14 (99s ago) 21d
tekton-pipelines-controller-76d89b568d-n8fdl 1/1 Running 10 (99s ago) 15d
tekton-pipelines-webhook-7c86c9f9b4-m7r7r 1/1 Running 12 (100s ago) 21d
tekton-results-api-7964ff94f6-df6b6 0/1 Running 0 13s
tekton-results-postgres-0 1/1 Running 0 13s
tekton-results-watcher-6c87f7bfd4-g99dk 1/1 Running 0 13s
> k -n tekton-pipelines logs tekton-results-api-7964ff94f6-df6b6
2023/07/20 12:58:44 maxprocs: Leaving GOMAXPROCS=5: CPU quota undefined
{"level":"warn","ts":1689857925.068253,"caller":"api/main.go:103","msg":"Error connecting to database (retrying in 10s): failed to connect to `host=tekton-results-postgres-service.tekton-pipelines.svc.cluster.local user=postgres database=tekton-results`: dial error (dial tcp 10.106.165.23:5432: connect: connection refused)"}
{"level":"info","ts":1689857935.189105,"caller":"api/main.go:119","msg":"Kubernetes RBAC authorization check enabled"}
{"level":"info","ts":1689857935.2092667,"caller":"api/main.go:131","msg":"Kubernetes RBAC impersonation enabled"}
{"level":"info","ts":1689857936.332291,"caller":"api/main.go:198","msg":"Prometheus server listening on: 9090"}
{"level":"info","ts":1689857936.376498,"caller":"api/main.go:240","msg":"gRPC and REST server listening on: 8080"}
2023/07/20 12:58:57 http: TLS handshake error from 127.0.0.1:35302: remote error: tls: bad certificate
2023/07/20 12:58:57 http: TLS handshake error from 127.0.0.1:35300: remote error: tls: bad certificate
2023/07/20 12:58:57 http: TLS handshake error from 127.0.0.1:35298: remote error: tls: bad certificate
2023/07/20 12:58:58 http: TLS handshake error from 127.0.0.1:35308: remote error: tls: bad certificate
2023/07/20 12:58:58 http: TLS handshake error from 127.0.0.1:35310: remote error: tls: bad certificate
2023/07/20 12:58:59 http: TLS handshake error from 127.0.0.1:35316: remote error: tls: bad certificate
2023/07/20 12:59:01 http: TLS handshake error from 127.0.0.1:35820: remote error: tls: bad certificate
2023/07/20 12:59:01 http: TLS handshake error from 127.0.0.1:35822: remote error: tls: bad certificate
2023/07/20 12:59:02 http: TLS handshake error from 127.0.0.1:35826: remote error: tls: bad certificate
After checkout to issue-521
> kubectl kustomize config/overlays/logs-local-db | ko apply -f -
> k -n tekton-pipelines get pods
NAME READY STATUS RESTARTS AGE
tekton-dashboard-69bfd6b5ff-b4pbc 1/1 Running 0 46h
tekton-events-controller-697685d976-6kmqw 1/1 Running 14 (7m10s ago) 21d
tekton-pipelines-controller-76d89b568d-n8fdl 1/1 Running 10 (7m10s ago) 15d
tekton-pipelines-webhook-7c86c9f9b4-m7r7r 1/1 Running 12 (7m11s ago) 21d
tekton-results-api-76b7559d7b-rkvrs 1/1 Running 0 23s
tekton-results-postgres-0 1/1 Running 0 5m44s
tekton-results-watcher-598d6b4c9-clbd4 1/1 Running 0 23s
> k -n tekton-pipelines logs tekton-results-api-76b7559d7b-rkvrs
2023/07/20 13:04:08 maxprocs: Leaving GOMAXPROCS=5: CPU quota undefined
{"level":"info","ts":1689858248.6841698,"caller":"api/main.go:120","msg":"Kubernetes RBAC authorization check enabled"}
{"level":"info","ts":1689858248.7065732,"caller":"api/main.go:132","msg":"Kubernetes RBAC impersonation enabled"}
{"level":"info","ts":1689858250.0390294,"caller":"api/main.go:199","msg":"Prometheus server listening on: 9090"}
{"level":"info","ts":1689858250.114967,"caller":"api/main.go:240","msg":"gRPC and REST server listening on: 8080"}
Thanks, that looks great 👍
Expected Behavior
Expected that
tekton-results-api
pod would become healthy.Actual Behavior
The
tekton-results-api
pod does not become healthy. Readiness and Liveness probes fail with 503 response. Doing a port-forward on thetekton-results-api
'sapi
container and doing a curl to the/healthz
endpoint returns the following error:I am using self-signed certificates managed by certman. Adding
localhost
to thednsNames
list results in the Liveness and Readiness probes passing. As per prerequisites this is not the recommended config.Steps to Reproduce the Problem
tekton-results-api
podAdditional Info