tektoncd / results

Long term storage of execution results.
Apache License 2.0
77 stars 74 forks source link

Liveness and Readiness probes fail with recommended certificate setup #521

Closed kbristow closed 1 year ago

kbristow commented 1 year ago

Expected Behavior

Expected that tekton-results-api pod would become healthy.

Actual Behavior

The tekton-results-api pod does not become healthy. Readiness and Liveness probes fail with 503 response. Doing a port-forward on the tekton-results-api's api container and doing a curl to the /healthz endpoint returns the following error:

{"code":14,"message":"connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for tekton-results-api-service.tekton-pipelines.svc.cluster.local, not localhost\"","details":[]}

I am using self-signed certificates managed by certman. Adding localhost to the dnsNames list results in the Liveness and Readiness probes passing. As per prerequisites this is not the recommended config.

Steps to Reproduce the Problem

  1. Install Certman
  2. Install Tekton Pipelines (v0.38.4)
  3. Create Certificate and Certificate issuer as below:
    apiVersion: cert-manager.io/v1
    kind: Issuer
    metadata:
    name: tekton-results-selfsigned-issuer
    namespace: tekton-pipelines
    spec:
    selfSigned: {}
    ---
    apiVersion: cert-manager.io/v1
    kind: Certificate
    metadata:
    name: tekton-results-tls-cert
    namespace: tekton-pipelines
    spec:
    dnsNames:
    - tekton-results-api-service.tekton-pipelines.svc.cluster.local
    # - localhost # uncomment this to make probes pass
    issuerRef:
    kind: Issuer
    name: tekton-results-selfsigned-issuer
    secretName: tekton-results-tls
  4. Install Tekton Results following all install instructions besides certificate generation (v0.7.0)
  5. Observe unhealthy tekton-results-api pod

Additional Info

Server Version: v1.23.17-eks-c12679a
v0.38.4
xinnjie commented 1 year ago

Readiness/Liveness probe don't check certificate verification . You should curl with --insecure option, and see what happen. Normally it should be like this:

# port-forwarding...
> curl --insecure  https://localhost:8080/healthz
{"status":"SERVING"}

If not, I suspect something else went wrong, e.g. configuration problem.

kbristow commented 1 year ago

I have indeed used the insecure option for curl. It looks like the actual go http server is returning the body i showed because the certificate loaded does not have localhost as a dns name. In the example above are you using a cert generated using the instructions here or are you also including localhost in the subject alternate names? As in my example, adding localhost to the dns names makes everything work, it just doesnt seem to match up with the instructions.

xinnjie commented 1 year ago

In the example above are you using a cert generated using the instructions here or are you also including localhost in the subject alternate names?

I followed the instructions, which do not include localhost in the subject alternate names.

It looks like the actual go http server

Yes, I think so after carefully reading the error message.

Have no idea right now. Maybe you could turn log level to debug to see what is happening in Results API Server .

xinnjie commented 1 year ago

I have followed steps above but failed to reproduce {"code":14,"message":"connection error: desc = \"transport: authentication handshake failed: x509: certificate is valid for tekton-results-api-service.tekton-pipelines.svc.cluster.local, not localhost\"","details":[]} behavior.

> k -n tekton-pipelines get certificate tekton-results-tls-cert -o yaml           
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  creationTimestamp: "2023-07-12T12:28:47Z"
  generation: 1
  name: tekton-results-tls-cert
  namespace: tekton-pipelines
  resourceVersion: "2167048"
  uid: 02b9f5ef-802d-49a4-b91d-c5c8156b5db2
spec:
  dnsNames:
  - tekton-results-api-service.tekton-pipelines.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: tekton-results-selfsigned-issuer
  secretName: tekton-results-tls
status:
  conditions:
  - lastTransitionTime: "2023-07-12T12:28:47Z"
    message: Certificate is up to date and has not expired
    observedGeneration: 1
    reason: Ready
    status: "True"
    type: Ready
  notAfter: "2023-10-10T12:28:47Z"
  notBefore: "2023-07-12T12:28:47Z"
  renewalTime: "2023-09-10T12:28:47Z"
  revision: 1
# after kubectl port-forward -n tekton-pipelines service/tekton-results-api-service 8080
> curl --insecure  https://localhost:8080/healthz                      
{"status":"SERVING"}
> k -n tekton-pipelines get pod tekton-results-api-59c8c69744-b7xkl                                            
NAME                                  READY   STATUS    RESTARTS   AGE
tekton-results-api-59c8c69744-b7xkl   1/1     Running   0          4m9s

Results version: v0.7.0

Could you please post Results API Server logs?

xinnjie commented 1 year ago
  1. Change logging level to debug. kubectl -n tekton-pipelines edit cm tekton-results-config-logging set as below
    
    zap-logger-config: |
    {
      "level": "debug",
      "development": true,
     .....
    }

2.  curl again
3.  check Results API Server logs 
kbristow commented 1 year ago

So interestingly, if i port forward to the service as you have done, I do not get the error. If I port forward to the pod, i do.

So example of what I was doing which returns a 503 code and the error response I showed earlier:

kubectl port-forward -n tekton-pipelines pods/tekton-results-api-57845986d9-8lrfw 8080:8080

Regardless, I bumped the logging level, and I tried again. The curls do not actually correlate to anything in the logs. There are a bunch of tls errors which seem to match the cadence of the probes:

2023/07/12 13:10:45 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
{"level":"warn","ts":1689167445.1729739,"caller":"api/main.go:116","msg":"Kubernetes RBAC authorization check disabled - all requests will be allowed by the API server"}
{"level":"info","ts":1689167445.3304975,"caller":"api/main.go:195","msg":"Prometheus server listening on: 9090"}
{"level":"info","ts":1689167445.3309426,"caller":"api/main.go:237","msg":"gRPC and REST server listening on: 8080"}
2023/07/12 13:10:54 http: TLS handshake error from 127.0.0.1:34026: remote error: tls: bad certificate
2023/07/12 13:11:04 http: TLS handshake error from 127.0.0.1:33524: remote error: tls: bad certificate
2023/07/12 13:11:09 http: TLS handshake error from 127.0.0.1:33542: remote error: tls: bad certificate
2023/07/12 13:11:14 http: TLS handshake error from 127.0.0.1:44222: remote error: tls: bad certificate
2023/07/12 13:11:24 http: TLS handshake error from 127.0.0.1:33502: remote error: tls: bad certificate

Here the events shown from a describe on the pod:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  6m20s                  default-scheduler  Successfully assigned tekton-pipelines/tekton-results-api-57845986d9-8lrfw to ip-10-123-123-12.af-south-1.compute.internal
  Normal   Pulled     4m40s (x2 over 6m19s)  kubelet            Container image "gcr.io/tekton-releases/github.com/tektoncd/results/cmd/api:v0.7.0@sha256:08aea538dc699c27715523fc89115cc805a0b5838c924868907189fcc99099b9" already present on machine
  Normal   Created    4m40s (x2 over 6m19s)  kubelet            Created container api
  Normal   Killing    4m40s                  kubelet            Container api failed startup probe, will be restarted
  Normal   Started    4m39s (x2 over 6m19s)  kubelet            Started container api
  Warning  Unhealthy  70s (x31 over 6m10s)   kubelet            Startup probe failed: HTTP probe failed with statuscode: 503

Note the 503 - which assumed meant the result was the same as that from the curl I was doing.

xinnjie commented 1 year ago

I dig a little.

Background:

Results API Server took a little trick to server as GRPC server and RESTful server simultaneously: on port 8080, it register both grpc service and http service and process request base on http protocol version and Content-Type. So one health check flow like this: (1)Kubelet probe -> (2)Results Server RESTful service -> (3)Results Server GRPC service.

Why Results API Server complained tls: bad certificate?

Problem occur when (2)->(3) happened. Results Server need to dial itself and forward this health check request with in GRPC format. Because original http request do not have authority header and the path Kubelet request is simply http://localhost:8080, localhost does not match to tekton-results-api-service.tekton-pipelines.svc.cluster.local in TLS configuration, so TLS handshake failed.

Workaround

Add environment variable TLS_HOSTNAME_OVERRIDE valued as tekton-results-api-service.tekton-pipelines.svc.cluster.local to Results API Server pod configuration. See this code

apiVersion: apps/v1
kind: Deployment
metadata:
  name: api
  namespace: tekton-pipelines
    spec:
      serviceAccountName: api
      containers:
        - name: api
          image: ....
          env:
            - name: TLS_HOSTNAME_OVERRIDE
              value: tekton-results-api-service.tekton-pipelines.svc.cluster.local

@sayan-biswas Is here something wrong? As document says:

TLS_HOSTNAME_OVERRIDE Override the hostname used to serve TLS. This should not be set (or set to the empty string) in production environments.

But it seems even in production environment, TLS_HOSTNAME_OVERRIDE need to be set to make liveness probe working.

Solution

  1. Add authority header to liveness/readiness HTTP probe when TLS is enabled.
  2. health check dial GRPC with insecure credentials.
xinnjie commented 1 year ago

I always use e2e test configuration so I did not reproduce the error earlier. E2e test configuration add TLS_HOSTNAME_OVERRIDE by default

sayan-biswas commented 1 year ago

@xinnjie This is not happening because of the multiplexing of the gRPC and REST traffic to one port. But this is happening because the REST service in this is a proxy to the gRPC server, which is provide by the package gRPC gateway. So, in code generated by the gRPC gateway I passed an instance of the gRPC client conn.

https://github.com/tektoncd/results/blob/e85794158c43defca77deb2b63201497dca9f45d/cmd/api/main.go#L213

Ad if I leave that field blank it will default to localhost and the certificate generated externally only contains the FQDN of the service in CNAME. That why when @kbristow as both the CNAMEs it was working. In production environment this will be behind an ingress which should re-encrypt the traffic with the generated certificate.

And I also think the document is a little unclear or outdated.

sayan-biswas commented 1 year ago

@kbristow The health probe logs are filtered out here. https://github.com/tektoncd/results/blob/e85794158c43defca77deb2b63201497dca9f45d/cmd/api/main.go#L146

Because it will spam the log. But maybe for debug we can enable this.

xinnjie commented 1 year ago

@kbristow The health probe logs are filtered out here.

https://github.com/tektoncd/results/blob/e85794158c43defca77deb2b63201497dca9f45d/cmd/api/main.go#L146

Because it will spam the log. But maybe for debug we can enable this.

@sayan-biswas How to explain 2023/07/12 13:10:54 http: TLS handshake error from 127.0.0.1:34026: remote error: tls: bad certificate this kind of log still exist? Looks like somewhere of http server do not use zap logger.

xinnjie commented 1 year ago

@xinnjie This is not happening because of the multiplexing of the gRPC and REST traffic to one port. But this is happening because the REST service in this is a proxy to the gRPC server, which is provide by the package gRPC gateway. So, in code generated by the gRPC gateway I passed an instance of the gRPC client conn.

https://github.com/tektoncd/results/blob/e85794158c43defca77deb2b63201497dca9f45d/cmd/api/main.go#L213

Ad if I leave that field blank it will default to localhost and the certificate generated externally only contains the FQDN of the service in CNAME. That why when @kbristow as both the CNAMEs it was working. In production environment this will be behind an ingress which should re-encrypt the traffic with the generated certificate.

And I also think the document is a little unclear or outdated.

@sayan-biswas Thanks for your explanations. I understood in production env it works fine now. What I am afraid is that a user wants to try out Results demo but encounter this"pod is being unhealthy constantly" behavior. Something needs to be done here.

sayan-biswas commented 1 year ago

@kbristow The health probe logs are filtered out here. https://github.com/tektoncd/results/blob/e85794158c43defca77deb2b63201497dca9f45d/cmd/api/main.go#L146

Because it will spam the log. But maybe for debug we can enable this.

@sayan-biswas How to explain 2023/07/12 13:10:54 http: TLS handshake error from 127.0.0.1:34026: remote error: tls: bad certificate this kind of log still exist? Looks like somewhere of http server do not use zap logger.

Yes, this comes from the http internal libraries IG.

sayan-biswas commented 1 year ago

@xinnjie This is not happening because of the multiplexing of the gRPC and REST traffic to one port. But this is happening because the REST service in this is a proxy to the gRPC server, which is provide by the package gRPC gateway. So, in code generated by the gRPC gateway I passed an instance of the gRPC client conn. https://github.com/tektoncd/results/blob/e85794158c43defca77deb2b63201497dca9f45d/cmd/api/main.go#L213

Ad if I leave that field blank it will default to localhost and the certificate generated externally only contains the FQDN of the service in CNAME. That why when @kbristow as both the CNAMEs it was working. In production environment this will be behind an ingress which should re-encrypt the traffic with the generated certificate. And I also think the document is a little unclear or outdated.

@sayan-biswas Thanks for your explanations. I understood in production env it works fine now. What I am afraid is that a user wants to try out Results demo but encounter this"pod is being unhealthy constantly" behavior. Something needs to be done here.

In that case I guess we can default the TLS_HOSTNAME_OVERRIDE to tekton-results-api-service.tekton-pipelines.svc.cluster.local

xinnjie commented 1 year ago

Yes, agreed.

xinnjie commented 1 year ago

@sayan-biswas Sorry~ I am still wondering after thinking twice. In production env without meshing service, how could the request from RESTful service to GRPC service to be re-encrypted with address being "localhost" and without authority header

kbristow commented 1 year ago

Add environment variable TLS_HOSTNAME_OVERRIDE valued as tekton-results-api-service.tekton-pipelines.svc.cluster.local to Results API Server pod configuration

@xinnjie Thanks. I did see this environment variable but like you say, it says not for use in production and the only use I saw in the repo itself was for the e2e tests. The documentation is certainly not clear on how to use it/when to use it. I will use for my setup for now though.

In production environment this will be behind an ingress which should re-encrypt the traffic with the generated certificate.

@sayan-biswas Im not sure I understand how having the service behind an ingress will help with the k8s probes, which have no knowledge of that ingress.

sayan-biswas commented 1 year ago

@kbristow Health probe is not related to ingress comment. What I meant is normally the service will be used in that way. So the cert manager can generate certificate with have multiple cnames and the ingress can use that cert to re-encrypt the traffic. We have faced the same issue, but not with health probes, but for normally usage of the REST APIs. If we leave the tls override blank, it will default to 'localhost' in the internal libraries and we have no control over that.

Since the TLS is already validated in the HTTP listener, there's no need to validate it again at the gRPC server listener. So we can stop verifying the authority hearder there.

I'll try this out and open a PR if it works, else I'll default the TLS_HOSTNAME_OVERRIDE config to the service FQDN and update the document.

kbristow commented 1 year ago

Thanks. I understand.

sayan-biswas commented 1 year ago

@kbristow Can you checkout PR #536 and see if that solves your issue?

kbristow commented 1 year ago

I will have a look when I can 👍

xinnjie commented 1 year ago

Tested on local machine. It fixed the problem as expected.

before: Results API Server complains about TLS handshake error from 127.0.0.1:35826: remote error: tls: bad certificate

> kubectl kustomize config/overlays/logs-local-db" | ko delete  -f -
> kubectl kustomize config/overlays/logs-local-db" | ko apply  -f -
... apply successfully
> k -n tekton-pipelines get pods                                            
NAME                                           READY   STATUS    RESTARTS        AGE
tekton-dashboard-69bfd6b5ff-b4pbc              1/1     Running   0               45h
tekton-events-controller-697685d976-6kmqw      1/1     Running   14 (99s ago)    21d
tekton-pipelines-controller-76d89b568d-n8fdl   1/1     Running   10 (99s ago)    15d
tekton-pipelines-webhook-7c86c9f9b4-m7r7r      1/1     Running   12 (100s ago)   21d
tekton-results-api-7964ff94f6-df6b6            0/1     Running   0               13s
tekton-results-postgres-0                      1/1     Running   0               13s
tekton-results-watcher-6c87f7bfd4-g99dk        1/1     Running   0               13s

> k -n tekton-pipelines logs tekton-results-api-7964ff94f6-df6b6 
2023/07/20 12:58:44 maxprocs: Leaving GOMAXPROCS=5: CPU quota undefined
{"level":"warn","ts":1689857925.068253,"caller":"api/main.go:103","msg":"Error connecting to database (retrying in 10s): failed to connect to `host=tekton-results-postgres-service.tekton-pipelines.svc.cluster.local user=postgres database=tekton-results`: dial error (dial tcp 10.106.165.23:5432: connect: connection refused)"}
{"level":"info","ts":1689857935.189105,"caller":"api/main.go:119","msg":"Kubernetes RBAC authorization check enabled"}
{"level":"info","ts":1689857935.2092667,"caller":"api/main.go:131","msg":"Kubernetes RBAC impersonation enabled"}
{"level":"info","ts":1689857936.332291,"caller":"api/main.go:198","msg":"Prometheus server listening on: 9090"}
{"level":"info","ts":1689857936.376498,"caller":"api/main.go:240","msg":"gRPC and REST server listening on: 8080"}
2023/07/20 12:58:57 http: TLS handshake error from 127.0.0.1:35302: remote error: tls: bad certificate
2023/07/20 12:58:57 http: TLS handshake error from 127.0.0.1:35300: remote error: tls: bad certificate
2023/07/20 12:58:57 http: TLS handshake error from 127.0.0.1:35298: remote error: tls: bad certificate
2023/07/20 12:58:58 http: TLS handshake error from 127.0.0.1:35308: remote error: tls: bad certificate
2023/07/20 12:58:58 http: TLS handshake error from 127.0.0.1:35310: remote error: tls: bad certificate
2023/07/20 12:58:59 http: TLS handshake error from 127.0.0.1:35316: remote error: tls: bad certificate
2023/07/20 12:59:01 http: TLS handshake error from 127.0.0.1:35820: remote error: tls: bad certificate
2023/07/20 12:59:01 http: TLS handshake error from 127.0.0.1:35822: remote error: tls: bad certificate
2023/07/20 12:59:02 http: TLS handshake error from 127.0.0.1:35826: remote error: tls: bad certificate

After checkout to issue-521

> kubectl kustomize config/overlays/logs-local-db  | ko apply  -f -

> k -n tekton-pipelines get pods                                            
NAME                                           READY   STATUS    RESTARTS         AGE
tekton-dashboard-69bfd6b5ff-b4pbc              1/1     Running   0                46h
tekton-events-controller-697685d976-6kmqw      1/1     Running   14 (7m10s ago)   21d
tekton-pipelines-controller-76d89b568d-n8fdl   1/1     Running   10 (7m10s ago)   15d
tekton-pipelines-webhook-7c86c9f9b4-m7r7r      1/1     Running   12 (7m11s ago)   21d
tekton-results-api-76b7559d7b-rkvrs            1/1     Running   0                23s
tekton-results-postgres-0                      1/1     Running   0                5m44s
tekton-results-watcher-598d6b4c9-clbd4         1/1     Running   0                23s

> k -n tekton-pipelines logs tekton-results-api-76b7559d7b-rkvrs 
2023/07/20 13:04:08 maxprocs: Leaving GOMAXPROCS=5: CPU quota undefined
{"level":"info","ts":1689858248.6841698,"caller":"api/main.go:120","msg":"Kubernetes RBAC authorization check enabled"}
{"level":"info","ts":1689858248.7065732,"caller":"api/main.go:132","msg":"Kubernetes RBAC impersonation enabled"}
{"level":"info","ts":1689858250.0390294,"caller":"api/main.go:199","msg":"Prometheus server listening on: 9090"}
{"level":"info","ts":1689858250.114967,"caller":"api/main.go:240","msg":"gRPC and REST server listening on: 8080"}
kbristow commented 1 year ago

Thanks, that looks great 👍