volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.23k stars 969 forks source link

failed calling webhook "validatejob.volcano.sh"--context deadline exceeded #3358

Open GhangZh opened 8 months ago

GhangZh commented 8 months ago

What happened: The volcano webhook often reports the following error

Internal error occurred: failed calling webhook "validatejob.volcano.sh": failed to call webhook: Post "https://volcano-admission-service.volcano-system.svc:443/jobs/validate?timeout=10s": context deadline exceeded

What you expected to happen: No error

Environment:

Monokaix commented 8 months ago

Actually it's not a bug. You should check whether webhook is running and check connection between kube-apiserver and webhook pod.

GhangZh commented 8 months ago

Actually it's not a bug. You should check whether webhook is running and check connection between kube-apiserver and webhook pod.

I checked that volcano-admission is running and only a few errors were reported image

Monokaix commented 8 months ago

Actually it's not a bug. You should check whether webhook is running and check connection between kube-apiserver and webhook pod.

I checked that volcano-admission is running and only a few errors were reported image

Seems that server didn't run successfully,you should check tls certificate signed right.

wangyysde commented 6 months ago

Seems that server didn't run successfully,you should check tls certificate signed right. I got the same errors in the past. No, the server was running with self-sign certification successfully. To fix this error we should to do as the following, I think:

  1. add the content of the CA which was generated by volcano into the trust files on the nodes that kube-apiserver is running.
  2. restart all kube-apiserver pod in the cluster.

PS,
Can volcano add the CA into the trusted CA on the nodes automatically when we deploy the volcano? How to do if it is true?

Monokaix commented 6 months ago

kube-apiserver use mutatingwebhookconfiguration/validatingwebhookconfiguration to accsee the admission server, and the configuration has already included the CA bundle generated by volcano: )

thincal commented 5 months ago

@GhangZh I have encountered same issue, do you have already solved it now?

googs1025 commented 5 months ago

I suddenly thought of a possibility. Could it be a network problem between the volcano-admission pod and other volcano pods?

root@VM-16-7-ubuntu:~# kubectl get pods -nvolcano-system -owide
NAME                                   READY   STATUS      RESTARTS   AGE   IP         NODE               NOMINATED NODE   READINESS GATES
volcano-admission-7f4fcd89b4-758h5     1/1     Running     0          56s   10.6.1.3   cluster1-worker    <none>           <none>
volcano-admission-init-7tvzn           0/1     Completed   0          56s   10.6.2.3   cluster1-worker2   <none>           <none>
volcano-controllers-6fb4668949-jpk7j   1/1     Running     0          56s   10.6.2.2   cluster1-worker2   <none>           <none>
volcano-scheduler-7f6f746f98-2xvk8     1/1     Running     0          56s   10.6.1.2   cluster1-worker    <none>           <none>
thincal commented 5 months ago

Also I have noticed that there are massive errors inside the kube-proxy with same node as the k8s-apiserver:

I0618 03:39:48.050278       1 proxier.go:854] "Sync failed" retryingTime="30s"
E0618 03:40:18.202297       1 proxier.go:1546] "Failed to execute iptables-restore" err=<
    exit status 4: iptables-restore v1.8.7 (nf_tables): 
    line 2080: CHAIN_USER_DEL failed (Device or resource busy): chain KUBE-SEP-XXX

So is that possible this is the root cause for this issue ?