nmstate / kubernetes-nmstate

Declarative node network configuration driven through Kubernetes API.
GNU General Public License v2.0
188 stars 90 forks source link

webhook not ready #1255

Closed ianb-mp closed 2 weeks ago

ianb-mp commented 4 months ago

What happened:

When applying a NodeNetworkConfigurationPolicy immediately after installation I sometimes see this error certificate signed by unknown authority:

$ kubectl create -f nncp-br0_bne-lab-srv-6.yaml
Error from server (InternalError): error when creating "nncp-br0_bne-lab-srv-6.yaml": Internal error occurred: failed calling webhook "nodenetworkconfigurationpolicies-mutate.nmstate.io": failed to call webhook: Post "https://nmstate-webhook.nmstate.svc:443/nodenetworkconfigurationpolicies-mutate?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "nmstate")

If I wait a few minutes and try again, it works without error. I see this in the nmstate-webhook pod log:

2024/07/18 05:44:51 http: TLS handshake error from 10.244.20.129:36716: remote error: tls: bad certificate                                                                                                        │
│ 2024/07/18 05:44:56 http: TLS handshake error from 10.244.140.131:38434: EOF                                                                                                                                      │
│ 2024/07/18 05:44:57 http: TLS handshake error from 10.244.20.129:36718: remote error: tls: bad certificate                                                                                                        │
│ 2024/07/18 05:44:57 http: TLS handshake error from 10.244.140.131:38444: EOF                                                                                                                                      │
│ 2024/07/18 05:44:57 http: TLS handshake error from 10.244.20.129:36722: EOF                                                                                                                                       │
│ {"level":"info","ts":"2024-07-18T05:45:15.120Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}                                                                                │
│ {"level":"info","ts":"2024-07-18T05:45:15.120Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}                                                                                │
│ 2024/07/18 05:45:38 http: TLS handshake error from 10.244.20.129:38656: EOF                                                                                                                                       │
│ 2024/07/18 05:45:38 http: TLS handshake error from 10.244.140.131:57550: EOF                                                                                                                                      │
│ 2024/07/18 05:45:39 http: TLS handshake error from 10.244.140.131:57566: EOF                                                                                                                                      │
│ 2024/07/18 05:45:39 http: TLS handshake error from 10.244.140.131:57578: remote error: tls: bad certificate                                                                                                       │
│ 2024/07/18 05:46:11 http: TLS handshake error from 10.244.140.131:46770: EOF                                                                                                                                      │
│ {"level":"info","ts":"2024-07-18T05:46:16.236Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}                                                                                │
│ {"level":"info","ts":"2024-07-18T05:46:16.236Z","logger":"controller-runtime.certwatcher","msg":"Updated current TLS certificate"}                                                                                │
│ {"level":"info","ts":"2024-07-18T05:47:01.137Z","logger":"webhook/nodenetworkconfigurationpolicy/mutator","msg":"webhook response: {Patches:[{Operation:add Path:/status/conditions Value:[map[lastHeartbeatTime: │
│ {"level":"info","ts":"2024-07-18T05:47:01.186Z","logger":"webhook/nodenetworkconfigurationpolicy/mutator","msg":"webhook response: {Patches:[{Operation:replace Path:/metadata/annotations/nmstate.io~1webhook-mu │
│ {"level":"info","ts":"2024-07-18T05:47:01.501Z","logger":"webhook/nodenetworkconfigurationpolicy/mutator","msg":"webhook response: {Patches:[{Operation:replace Path:/metadata/annotations/nmstate.io~1webhook-mu │

So it looks like the webhook is still deploying. I tried adding a wait check before applying the policy e.g.

kubectl -n nmstate wait deploy nmstate-webhook --for condition=Available --timeout 300s"

However this isn't reliable - the error still occurs sometimes. It would be good to have a way to test whether nmstate operator is fully ready before trying to apply policies.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

qinqon commented 3 months ago

@ianb-mp you have to wait for all the pods under kubernetes-nmstate to be ready state, before apply an NNCP.

ianb-mp commented 3 months ago

@ianb-mp you have to wait for all the pods under kubernetes-nmstate to be ready state, before apply an NNCP.

I have waited for all pods to be in ready state, and the error still occurs:

$ kubectl wait --for=condition=ready pod -n nmstate --all
pod/nmstate-cert-manager-5788576df8-rkknl condition met
pod/nmstate-handler-kwlwx condition met
pod/nmstate-handler-qfxg7 condition met
pod/nmstate-metrics-6889dd975d-58br9 condition met
pod/nmstate-operator-685cc75cd8-xwcc2 condition met
pod/nmstate-webhook-65447bb9f-5fkwz condition met
$ kubectl create -f nmstate.yaml 
Error from server (InternalError): error when creating "nmstate.yaml": Internal error occurred: failed calling webhook "nodenetworkconfigurationpolicies-mutate.nmstate.io": failed to call webhook: Post "https://nmstate-webhook.nmstate.svc:443/nodenetworkconfigurationpolicies-mutate?timeout=10s": tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of "crypto/rsa: verification error" while trying to verify candidate authority certificate "nmstate")

I wait ~60 seconds and try again, it works:

$ kubectl create -f nmstate.yaml 
nodenetworkconfigurationpolicy.nmstate.io/sriovpf-bne-lab-srv-6 created
P-n-I commented 3 months ago

I've also seen this issue as described able and in a slightly different form of the handler pods not able to resolve the webhook service unless I git the NMState a bit of time to stand up. Not tested yet but if/when using ArgoCD to deploy I don't think we'll have a way to control how soon the nntp resources are applied after the install of operator resources and the NMState.

qinqon commented 2 months ago

the nmstate handler pods are certificate aware at readiness probe, we have to wait for them to be ready before apply an NNCP.

ianb-mp commented 2 months ago

the nmstate handler pods are certificate aware at readiness probe, we have to wait for them to be ready before apply an NNCP.

The readiness probe should not report healthy until the certificate is ready. Then the user will know they can proceed. Am I misunderstanding?

qinqon commented 2 months ago

the nmstate handler pods are certificate aware at readiness probe, we have to wait for them to be ready before apply an NNCP.

The readiness probe should not report healthy until the certificate is ready. Then the user will know they can proceed. Am I misunderstanding?

yep, that's it. ideally we implement this check at the operator and we have some Status at the NMState CR but we are not there yet.

ianb-mp commented 2 weeks ago

Is there a fix?