knative-extensions / eventing-natss

NATS streaming integration with Knative Eventing.
Apache License 2.0
40 stars 41 forks source link

x509 certificate signed by unknown authority #383

Closed astelmashenko closed 1 year ago

astelmashenko commented 1 year ago

Describe the bug After deleting a broker, jetstream controller can not do finalization of underlying channel because of comminication error with nats-webhook.

Expected behavior Broker/channel delete is working.

Knative release version 1.3.2

Additional context eventing-natss version is 1.3.5

I create a broker and then deleted it, then observed that channel has not been deleted. And observer error logs. jetstream-channel-controller:

{
    "level": "error",
    "ts": "2022-12-23T11:30:23.039Z",
    "logger": "jetstream-channel-controller",
    "caller": "controller/controller.go:559",
    "msg": "Reconcile error",
    "knative.dev/pod": "jetstream-ch-controller-57c65d84fb-5p5pm",
    "knative.dev/controller": "knative.dev.eventing-natss.pkg.channel.jetstream.controller.Reconciler",
    "knative.dev/kind": "messaging.knative.dev.NatsJetStreamChannel",
    "knative.dev/traceid": "b83875e4-319c-447c-ae15-0cff0e8ab9e3",
    "knative.dev/key": "viax/internal-kne-trigger",
    "duration": 0.049331877,
    "error": "failed to clear finalizers: Internal error occurred: failed calling webhook \"webhook.nats.messaging.knative.dev\": Post \"https://nats-webhook.knative-eventing.svc:443/defaulting?timeout=2s\": x509: certificate signed by unknown authority (possibly because of \"x509: ECDSA verification failure\" while trying to verify candidate authority certificate \"nats-webhook.knative-eventing.svc\")",
    "stacktrace": "knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20220301181942-2fdd5f232e77/controller/controller.go:559\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20220301181942-2fdd5f232e77/controller/controller.go:536\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20220301181942-2fdd5f232e77/controller/controller.go:484"
}

and nats-webhook logs:

{"level":"info","ts":"2022-12-23T13:40:55.322Z","logger":"nats-webhook","caller":"webhook/admission.go:90","msg":"Webhook ServeHTTP request=&http.Request{Method:\"POST\", URL:(*url.URL)(0xc000a02cf0), Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\"Accept\":[]string{\"application/json, */*\"}, \"Accept-Encoding\":[]string{\"gzip\"}, \"Content-Length\":[]string{\"37445\"}, \"Content-Type\":[]string{\"application/json\"}, \"User-Agent\":[]string{\"kube-apiserver-admission\"}}, Body:(*http.body)(0xc0008b7700), GetBody:(func() (io.ReadCloser, error))(nil), ContentLength:37445, TransferEncoding:[]string(nil), Close:false, Host:\"nats-webhook.knative-eventing.svc:443\", Form:url.Values(nil), PostForm:url.Values(nil), MultipartForm:(*multipart.Form)(nil), Trailer:http.Header(nil), RemoteAddr:\"172.212.1.64:38352\", RequestURI:\"/defaulting?timeout=2s\", TLS:(*tls.ConnectionState)(0xc00019d6b0), Cancel:(<-chan struct {})(nil), Response:(*http.Response)(nil), ctx:(*context.cancelCtx)(0xc0008b7740)}"}
{"level":"info","ts":"2022-12-23T13:40:55.341Z","logger":"nats-webhook","caller":"defaulting/defaulting.go:158","msg":"Kind: \"messaging.knative.dev/v1alpha1, Kind=NatsJetStreamChannel\" PatchBytes: null","knative.dev/kind":"messaging.knative.dev/v1alpha1, Kind=NatsJetStreamChannel","knative.dev/namespace":"viax","knative.dev/name":"internal-kne-trigger","knative.dev/operation":"UPDATE","knative.dev/resource":"messaging.knative.dev/v1alpha1, Resource=natsjetstreamchannels","knative.dev/subresource":"","knative.dev/userinfo":"{system:serviceaccount:knative-eventing:jetstream-ch-controller 1d7b066d-347a-4466-9374-fcf4d8529081 [system:serviceaccounts system:serviceaccounts:knative-eventing system:authenticated] map[authentication.kubernetes.io/pod-name:[jetstream-ch-controller-57c65d84fb-fw8td] authentication.kubernetes.io/pod-uid:[a05e83d6-0041-4728-b9d6-aab1ead4b94c]]}"}
{"level":"info","ts":"2022-12-23T13:40:55.341Z","logger":"nats-webhook","caller":"webhook/admission.go:133","msg":"remote admission controller audit annotations=map[string]string(nil)","knative.dev/kind":"messaging.knative.dev/v1alpha1, Kind=NatsJetStreamChannel","knative.dev/namespace":"viax","knative.dev/name":"internal-kne-trigger","knative.dev/operation":"UPDATE","knative.dev/resource":"messaging.knative.dev/v1alpha1, Resource=natsjetstreamchannels","knative.dev/subresource":"","knative.dev/userinfo":"{system:serviceaccount:knative-eventing:jetstream-ch-controller 1d7b066d-347a-4466-9374-fcf4d8529081 [system:serviceaccounts system:serviceaccounts:knative-eventing system:authenticated] map[authentication.kubernetes.io/pod-name:[jetstream-ch-controller-57c65d84fb-fw8td] authentication.kubernetes.io/pod-uid:[a05e83d6-0041-4728-b9d6-aab1ead4b94c]]}","admissionreview/uid":"f32148bf-86b0-465a-b048-133d866910d3","admissionreview/allowed":true,"admissionreview/result":"nil"}
{"level":"debug","ts":"2022-12-23T13:40:55.341Z","logger":"nats-webhook","caller":"webhook/admission.go:134","msg":"AdmissionReview patch={ type: JSONPatch, body: null }","knative.dev/kind":"messaging.knative.dev/v1alpha1, Kind=NatsJetStreamChannel","knative.dev/namespace":"viax","knative.dev/name":"internal-kne-trigger","knative.dev/operation":"UPDATE","knative.dev/resource":"messaging.knative.dev/v1alpha1, Resource=natsjetstreamchannels","knative.dev/subresource":"","knative.dev/userinfo":"{system:serviceaccount:knative-eventing:jetstream-ch-controller 1d7b066d-347a-4466-9374-fcf4d8529081 [system:serviceaccounts system:serviceaccounts:knative-eventing system:authenticated] map[authentication.kubernetes.io/pod-name:[jetstream-ch-controller-57c65d84fb-fw8td] authentication.kubernetes.io/pod-uid:[a05e83d6-0041-4728-b9d6-aab1ead4b94c]]}","admissionreview/uid":"f32148bf-86b0-465a-b048-133d866910d3","admissionreview/allowed":true,"admissionreview/result":"nil"}
2022/12/23 13:40:55 http: TLS handshake error from 172.212.1.64:50934: remote error: tls: bad certificate

is it really certificate problem? one strange thing is this message: AdmissionReview patch={ type: JSONPatch, body: null } from the last debug log before error log remote error: tls: bad certificate

Any thoughts?

cc @dan-j @lionelvillard @zhaojizhuang

astelmashenko commented 1 year ago

It looks like similar to https://github.com/tektoncd/triggers/issues/875 Few things I'm trying to understand:

  1. Does it really do anything usefull? I'm looking at stub code and can not get what does it do https://github.com/knative-sandbox/eventing-natss/blob/release-1.3/pkg/webhook/controller.go
  2. Could it happen because of we added cert manager?
  3. I've checked this https://github.com/tektoncd/triggers/issues/875#issuecomment-997084884 and I see the same ValidatingWebhookConfiguration caBundle is the same as secret nats-webhook-certs ca-cert.pem, so it is not the case then?

Another thing is from webhook.Options

    // SecretName is the name of k8s secret that contains the webhook
    // server key/cert and corresponding CA cert that signed them. The
    // server key/cert are used to serve the webhook and the CA cert
    // is provided to k8s apiserver during admission controller
    // registration.
    // If no SecretName is provided, then the webhook serves without TLS.
    SecretName string

and webhook has secretname hardcoded

const (
    // Component is the name of this component and is used in logging and leader-election
    Component = "nats-webhook"

    // SecretName must match the name of the Secret created in the configuration.
    SecretName = "nats-webhook-certs"
)

We have mTLS and using istio, then we do not need webhook tls at all and we can not change that. Do we need to make it optional, like NameFromEnv() maybe have SecretNameFromEnv() ?

astelmashenko commented 1 year ago

Ok, we found a way how to reproduce it:

  1. provision eventing jsm as normal (from https://github.com/knative-sandbox/eventing-natss/releases/download/knative-v1.3.5/eventing-jsm.yaml)
  2. observe that evething is working, e.g. create/delete a broker
  3. delete nats-webhook-certs secret or it's data
  4. see nats-webhook reconcile it successfully
  5. create e.g. a broker, now nats-webhook writes errror
    {"level":"error","ts":"2022-12-27T13:24:11.453Z","logger":"nats-webhook.DefaultingWebhook","caller":"controller/controller.go:559","msg":"Reconcile error","knative.dev/traceid":"d2b98046-a51c-4912-aa27-c3ff928d9501","knative.dev/key":"defaulting.webhook.nats.messaging.knative.dev","duration":0.000133991,"error":"error retrieving webhook: mutatingwebhookconfiguration.admissionregistration.k8s.io \"defaulting.webhook.nats.messaging.knative.dev\" not found","stacktrace":"knative.dev/pkg/controller.(*Impl).handleErr\n\tknative.dev/pkg@v0.0.0-20220301181942-2fdd5f232e77/controller/controller.go:559\nknative.dev/pkg/controller.(*Impl).processNextWorkItem\n\tknative.dev/pkg@v0.0.0-20220301181942-2fdd5f232e77/controller/controller.go:536\nknative.dev/pkg/controller.(*Impl).RunContext.func3\n\tknative.dev/pkg@v0.0.0-20220301181942-2fdd5f232e77/controller/controller.go:484"}
    2022/12/27 13:36:29 http: TLS handshake error from 127.0.0.6:49317: remote error: tls: bad certificate
astelmashenko commented 1 year ago

After above steps we can not fix badcertificate problem. Tried to delete pods, deployments, restarting webhook and controller, nothing helps.

astelmashenko commented 1 year ago

@zhaojizhuang , @lionelvillard , do you any ideas how to fix that? Is there any caching of certificates somewhere?

chris93111 commented 1 year ago

Hi @astelmashenko encountered this error, just delete and recreate webhook validation and mutation resolve the problem

This will force to recreate cert webhook

astelmashenko commented 1 year ago

new investigations, according to logs:

{"level":"info","ts":"2022-12-28T16:32:44.280Z","logger":"nats-webhook","caller":"webhook/admission.go:90","msg":"Webhook ServeHTTP request=&http.Request{Method:\"POST\", URL:(*url.URL)(0xc000949830), Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\"Accept\":[]string{\"application/json, */*\"}, \"Accept-Encoding\":[]string{\"gzip\"}, \"Content-Length\":[]string{\"37445\"}, \"Content-Type\":[]string{\"application/json\"}, \"User-Agent\":[]string{\"kube-apiserver-admission\"}}, Body:(*http.body)(0xc0009ba980), GetBody:(func() (io.ReadCloser, error))(nil), ContentLength:37445, TransferEncoding:[]string(nil), Close:false, Host:\"nats-webhook.knative-eventing.svc:443\", Form:url.Values(nil), PostForm:url.Values(nil), MultipartForm:(*multipart.Form)(nil), Trailer:http.Header(nil), RemoteAddr:\"127.0.0.6:52401\", RequestURI:\"/defaulting?timeout=2s\", TLS:(*tls.ConnectionState)(0xc00083fc30), Cancel:(<-chan struct {})(nil), Response:(*http.Response)(nil), ctx:(*context.cancelCtx)(0xc0009ba9c0)}"}
{"level":"info","ts":"2022-12-28T16:32:44.297Z","logger":"nats-webhook","caller":"defaulting/defaulting.go:158","msg":"Kind: \"messaging.knative.dev/v1alpha1, Kind=NatsJetStreamChannel\" PatchBytes: null","knative.dev/kind":"messaging.knative.dev/v1alpha1, Kind=NatsJetStreamChannel","knative.dev/namespace":"viax","knative.dev/name":"internal-kne-trigger","knative.dev/operation":"UPDATE","knative.dev/resource":"messaging.knative.dev/v1alpha1, Resource=natsjetstreamchannels","knative.dev/subresource":"","knative.dev/userinfo":"{system:serviceaccount:knative-eventing:jetstream-ch-controller eabc9d9c-bfc6-4410-932f-3e37b5aa6b15 [system:serviceaccounts system:serviceaccounts:knative-eventing system:authenticated] map[authentication.kubernetes.io/pod-name:[jetstream-ch-controller-57c65d84fb-9h2c7] authentication.kubernetes.io/pod-uid:[d0e2ebd1-aa22-4700-823f-cea550500b29]]}"}
{"level":"info","ts":"2022-12-28T16:32:44.297Z","logger":"nats-webhook","caller":"webhook/admission.go:133","msg":"remote admission controller audit annotations=map[string]string(nil)","knative.dev/kind":"messaging.knative.dev/v1alpha1, Kind=NatsJetStreamChannel","knative.dev/namespace":"viax","knative.dev/name":"internal-kne-trigger","knative.dev/operation":"UPDATE","knative.dev/resource":"messaging.knative.dev/v1alpha1, Resource=natsjetstreamchannels","knative.dev/subresource":"","knative.dev/userinfo":"{system:serviceaccount:knative-eventing:jetstream-ch-controller eabc9d9c-bfc6-4410-932f-3e37b5aa6b15 [system:serviceaccounts system:serviceaccounts:knative-eventing system:authenticated] map[authentication.kubernetes.io/pod-name:[jetstream-ch-controller-57c65d84fb-9h2c7] authentication.kubernetes.io/pod-uid:[d0e2ebd1-aa22-4700-823f-cea550500b29]]}","admissionreview/uid":"61b4f1df-0e47-484c-98ce-c064e6cd9e68","admissionreview/allowed":true,"admissionreview/result":"nil"}
2022/12/28 16:32:44 http: TLS handshake error from 127.0.0.6:45837: remote error: tls: bad certificate

webhook receives request and admission.go admissionHandler is called. It mean that error happens it tries to write response back?

astelmashenko commented 1 year ago

One more thing is I'm able to reproduce it on working cluster only, which is 1.21 version. It does not reproduce on 1.23 local minikube setup.

astelmashenko commented 1 year ago

oh, god, I found the issue. There was MutatingWebhookConfiguration left from previous installation of eventing-natss.yaml, it's name was webhook.nats.messaging.knative.dev and then it got renamed to defaulting.webhook.nats.messaging.knative.dev.