Creating a CRD with broken converter webhook prevents GC controller from initialization

jprzychodzen commented 3 years ago

What happened:

Creating a CRD with broken converter webhook prevents GC controller from initialization, which breaks on informer sync. Additionally, this issue is not visible until gc controller restarts - dynamically added crd resources with non-working converter webhook do not break running GC.

What you expected to happen:

GC controller should initialize with available informers. CRDs with broken converter webhook should not prevent GC controller from working on other resources.

How to reproduce it (as minimally and precisely as possible):

Create a cluster
Create a CRD (1_crd.yaml)
Create a CR (2_crd.yaml)
Change CRD to add another version and webhook converter (3_crd.yaml)
Restart kube-controll-manager
Add deployment
Delete deployment
Check that pod from deployment is not GC'ed

gc-bug.zip

neolit123 commented 3 years ago

/sig api-machinery

neolit123 commented 3 years ago

please follow the issue template correctly: https://github.com/kubernetes/kubernetes/blob/master/.github/ISSUE_TEMPLATE/bug-report.md

including k8s version is important.

jprzychodzen commented 3 years ago

Sure, it happens on current K8s master branch - exact commit is b0abe89ae259d5e891887414cb0e5f81c969c697

Kubernetes version (use kubectl version):

kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z", 
GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.0-alpha.0.30+b0abe89ae259d5-dirty", GitCommit:"b0abe89ae259d5e891887414cb0e5f81c969c697", GitTreeState:"dirty", 
BuildDate:"2021-04-13T16:11:56Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration: K8s cluster running on GCE started with kubetest. 10 nodes with --gcp-node-size=n1-standard-1 and with preset-e2e-scalability-common env variables

OS (e.g: cat /etc/os-release):

cat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
GOOGLE_METRICS_PRODUCT_ID=26
GOOGLE_CRASH_ID=Lakitu
KERNEL_COMMIT_ID=9ca830b4d7ae9ff76f64f4f9f78a0a0b88dfcda4
VERSION=85
VERSION_ID=85
BUILD_ID=13310.1041.9

Kernel (e.g. uname -a):

Linux e2e-test-jprzychodzen-master 5.4.49+ #1 SMP Wed Sep 23 19:45:38 PDT 2020 x86_64 Intel(R) Xeon(R) CPU @ 2.00GHz GenuineIntel GNU/Linux

Install tools: kubetest

fedebongio commented 3 years ago

/assign @yliaog /cc @caesarxuchao @leilajal /triage accepted

yliaog commented 3 years ago

i think this is the same issue as reported in https://github.com/kubernetes/kubernetes/issues/90597

jprzychodzen commented 3 years ago

It might share root cause - informer sync in case of GC should non-block on CRDs (and possibly on other resources?)

I guess we would need some metrics about unsynced informers to handle this properly.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

spencer-p commented 3 years ago

/remove-lifecycle stale

aojea commented 3 years ago

This seems to be this way by design, see this duplicate https://github.com/kubernetes/kubernetes/issues/96066#issuecomment-721836316

willdeuschle commented 2 years ago

This is problematic for our environments as well. Unstable user-defined conversion webhooks break GC for unrelated resources, those unrelated resources eventually hit quota limits and render the environment unusable.

Is there a recommended approach to this from the community? One naive solution that comes to mind is a config option for marking a CRD as non-blocking for GC. Then GC would only respect blockOwnerDeletion in a best effort fashion, for example. Admission webhooks could then block CRD creation that specified a conversion webhook without making the resource non-blocking.

Without this, it's hard to allow users to specify conversion webhooks because k8s then takes a dependency on those services (which in our case, already take a dependency on k8s).

liggitt commented 2 years ago

I think I'd push to make gc stop blocking on discovery or informer sync at all, and make blockOwnerDeletion even more best effort.

deads2k commented 2 years ago

I think I'd push to make gc stop blocking on discovery or informer sync at all, and make blockOwnerDeletion even more best effort.

I'd like to stop honoring blockOwnerDeletion. :)

tkashem commented 2 years ago

cc @tkashem

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

haorenfsa commented 2 years ago

Is there any ongoing work for this?

liggitt commented 2 years ago

Is there any ongoing work for this?

None that I know of. At first glance, removing the requirement that all informers be fully synced before GC starts/resumes seems reasonable to me, and would resolve this issue.

haorenfsa commented 2 years ago

OK, I'll try to work out a patch

tossmilestone commented 2 years ago

I am now working on this, will fix it soon.

rauferna commented 1 year ago

Hi @tossmilestone,

What is the status of this? One year has passed. Is there any short term plan to fix this?

Thanks!

tossmilestone commented 1 year ago

Hi @tossmilestone,

What is the status of this? One year has passed. Is there any short term plan to fix this?

Thanks!

Sorry, I don't have the time right now to continue fixing this issue. If you're willing, you can help continue this work. Thank you!

haorenfsa commented 1 year ago

@rauferna not likely in a short term. A quick solution is that you delete the converter webhook when you find your CRD controller is not working. and add it back when it recovers. Or you deploy multiple replicas to avoid the downtime as possible.

k8s-triage-robot commented 5 months ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

seans3 commented 5 months ago

/triage accepted

kubernetes / kubernetes