kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
108.33k stars 38.87k forks source link

Creating a CRD with broken converter webhook prevents GC controller from initialization #101078

Open jprzychodzen opened 3 years ago

jprzychodzen commented 3 years ago

What happened:

Creating a CRD with broken converter webhook prevents GC controller from initialization, which breaks on informer sync. Additionally, this issue is not visible until gc controller restarts - dynamically added crd resources with non-working converter webhook do not break running GC.

What you expected to happen:

GC controller should initialize with available informers. CRDs with broken converter webhook should not prevent GC controller from working on other resources.

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster
  2. Create a CRD (1_crd.yaml)
  3. Create a CR (2_crd.yaml)
  4. Change CRD to add another version and webhook converter (3_crd.yaml)
  5. Restart kube-controll-manager
  6. Add deployment
  7. Delete deployment
  8. Check that pod from deployment is not GC'ed

gc-bug.zip

neolit123 commented 3 years ago

/sig api-machinery

neolit123 commented 3 years ago

please follow the issue template correctly: https://github.com/kubernetes/kubernetes/blob/master/.github/ISSUE_TEMPLATE/bug-report.md

including k8s version is important.

jprzychodzen commented 3 years ago

Sure, it happens on current K8s master branch - exact commit is b0abe89ae259d5e891887414cb0e5f81c969c697

fedebongio commented 3 years ago

/assign @yliaog /cc @caesarxuchao @leilajal /triage accepted

yliaog commented 3 years ago

i think this is the same issue as reported in https://github.com/kubernetes/kubernetes/issues/90597

jprzychodzen commented 3 years ago

It might share root cause - informer sync in case of GC should non-block on CRDs (and possibly on other resources?)

I guess we would need some metrics about unsynced informers to handle this properly.

fejta-bot commented 2 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

spencer-p commented 2 years ago

/remove-lifecycle stale

aojea commented 2 years ago

This seems to be this way by design, see this duplicate https://github.com/kubernetes/kubernetes/issues/96066#issuecomment-721836316

willdeuschle commented 2 years ago

This is problematic for our environments as well. Unstable user-defined conversion webhooks break GC for unrelated resources, those unrelated resources eventually hit quota limits and render the environment unusable.

Is there a recommended approach to this from the community? One naive solution that comes to mind is a config option for marking a CRD as non-blocking for GC. Then GC would only respect blockOwnerDeletion in a best effort fashion, for example. Admission webhooks could then block CRD creation that specified a conversion webhook without making the resource non-blocking.

Without this, it's hard to allow users to specify conversion webhooks because k8s then takes a dependency on those services (which in our case, already take a dependency on k8s).

liggitt commented 2 years ago

I think I'd push to make gc stop blocking on discovery or informer sync at all, and make blockOwnerDeletion even more best effort.

deads2k commented 2 years ago

I think I'd push to make gc stop blocking on discovery or informer sync at all, and make blockOwnerDeletion even more best effort.

I'd like to stop honoring blockOwnerDeletion. :)

tkashem commented 2 years ago

cc @tkashem

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

haorenfsa commented 2 years ago

Is there any ongoing work for this?

liggitt commented 2 years ago

Is there any ongoing work for this?

None that I know of. At first glance, removing the requirement that all informers be fully synced before GC starts/resumes seems reasonable to me, and would resolve this issue.

haorenfsa commented 2 years ago

OK, I'll try to work out a patch

tossmilestone commented 2 years ago

I am now working on this, will fix it soon.

rauferna commented 1 year ago

Hi @tossmilestone,

What is the status of this? One year has passed. Is there any short term plan to fix this?

Thanks!

tossmilestone commented 1 year ago

Hi @tossmilestone,

What is the status of this? One year has passed. Is there any short term plan to fix this?

Thanks!

Sorry, I don't have the time right now to continue fixing this issue. If you're willing, you can help continue this work. Thank you!

haorenfsa commented 1 year ago

@rauferna not likely in a short term. A quick solution is that you delete the converter webhook when you find your CRD controller is not working. and add it back when it recovers. Or you deploy multiple replicas to avoid the downtime as possible.

k8s-triage-robot commented 2 weeks ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

seans3 commented 2 weeks ago

/triage accepted