volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.21k stars 967 forks source link

When volcano-admission pod not running, create other pod can faild #3734

Open lengrongfu opened 1 month ago

lengrongfu commented 1 month ago

Description

When volcano-admission pod crash, It will affect me creating other pods.

Steps to reproduce the issue

  1. install volcano use helm install
  2. scale volcano-admission replicas to 0, simulation volcano-admission pod crash
    $ kubectl -n volcano scale deployment volcano-admission --replicas 0
  3. run a pod
    $ kubectl run nginx --image=nginx

Describe the results you received and expected

received results

root@ubuntu:~# kubectl run nginx --image=nginx
Error from server (InternalError): Internal error occurred: failed calling webhook "mutatepod.volcano.sh": failed to call webhook: Post "https://volcano-admission-service.volcano.svc:443/pods/mutate?timeout=10s": no endpoints available for service "volcano-admission-service"

expected results: can create pod success.

What version of Volcano are you using?

1.9.0

Any other relevant information

No response

lengrongfu commented 1 month ago

/assign

lengrongfu commented 1 month ago

I have two ideas:

  1. Modify the failurePolicy field in the webhook
  2. Add a unique label to the pod, and then the webhook selects by label
googs1025 commented 1 month ago

i think the problem of pod creation failing after a webhook crash is a common problem with webhooks. If you want other pods in the cluster not to be affected, you can modify the failurePolicy field of the webhook. refer to: https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/

Monokaix commented 1 month ago

The first solution change failurePolicy to ignore is ok.

lowang-bh commented 1 month ago

You can also disable the webhook don't need. Just modify it in enabled_admissions: "/jobs/mutate,/jobs/validate,/podgroups/mutate,/pods/validate,/pods/mutate,/queues/mutate,/queues/validate"

lengrongfu commented 1 month ago

I don't think the solution of configuring failurePolicy=Ignore is very good. I suggest that we can configure matchConditions. I have verified that it works well.

  matchConditions:
  - expression: object.spec.schedulerName == 'volcano'
    name: scheduler  
googs1025 commented 1 month ago

I think the key question is whether we need to set this feature as the default configuration for helm installation. Is this what you mean?

lengrongfu commented 1 month ago

I think the key question is whether we need to set this feature as the default configuration for helm installation. Is this what you mean?

yes.

Monokaix commented 1 month ago

I don't think the solution of configuring failurePolicy=Ignore is very good. I suggest that we can configure matchConditions. I have verified that it works well.

  matchConditions:
  - expression: object.spec.schedulerName == 'volcano'
    name: scheduler  

It's ok to me, but change failurePolicy to Ignore is also needed: )

Monokaix commented 1 month ago

/good-first-issue

volcano-sh-bot commented 1 month ago

@Monokaix: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/volcano-sh/volcano/issues/3734): >/good-first-issue Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.