Closed jprzychodzen closed 1 month ago
/sig api-machinery
please follow the issue template correctly: https://github.com/kubernetes/kubernetes/blob/master/.github/ISSUE_TEMPLATE/bug-report.md
including k8s version is important.
Sure, it happens on current K8s master branch - exact commit is b0abe89ae259d5e891887414cb0e5f81c969c697
kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.0", GitCommit:"cb303e613a121a29364f75cc67d3d580833a7479", GitTreeState:"clean", BuildDate:"2021-04-08T16:31:21Z",
GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"22+", GitVersion:"v1.22.0-alpha.0.30+b0abe89ae259d5-dirty", GitCommit:"b0abe89ae259d5e891887414cb0e5f81c969c697", GitTreeState:"dirty",
BuildDate:"2021-04-13T16:11:56Z", GoVersion:"go1.16.1", Compiler:"gc", Platform:"linux/amd64"}
--gcp-node-size=n1-standard-1
and with preset-e2e-scalability-common env variablescat /etc/os-release
NAME="Container-Optimized OS"
ID=cos
PRETTY_NAME="Container-Optimized OS from Google"
HOME_URL="https://cloud.google.com/container-optimized-os/docs"
BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us"
GOOGLE_METRICS_PRODUCT_ID=26
GOOGLE_CRASH_ID=Lakitu
KERNEL_COMMIT_ID=9ca830b4d7ae9ff76f64f4f9f78a0a0b88dfcda4
VERSION=85
VERSION_ID=85
BUILD_ID=13310.1041.9
Linux e2e-test-jprzychodzen-master 5.4.49+ #1 SMP Wed Sep 23 19:45:38 PDT 2020 x86_64 Intel(R) Xeon(R) CPU @ 2.00GHz GenuineIntel GNU/Linux
/assign @yliaog /cc @caesarxuchao @leilajal /triage accepted
i think this is the same issue as reported in https://github.com/kubernetes/kubernetes/issues/90597
It might share root cause - informer sync in case of GC should non-block on CRDs (and possibly on other resources?)
I guess we would need some metrics about unsynced informers to handle this properly.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
This seems to be this way by design, see this duplicate https://github.com/kubernetes/kubernetes/issues/96066#issuecomment-721836316
This is problematic for our environments as well. Unstable user-defined conversion webhooks break GC for unrelated resources, those unrelated resources eventually hit quota limits and render the environment unusable.
Is there a recommended approach to this from the community? One naive solution that comes to mind is a config option for marking a CRD as non-blocking for GC. Then GC would only respect blockOwnerDeletion
in a best effort fashion, for example. Admission webhooks could then block CRD creation that specified a conversion webhook without making the resource non-blocking.
Without this, it's hard to allow users to specify conversion webhooks because k8s then takes a dependency on those services (which in our case, already take a dependency on k8s).
I think I'd push to make gc stop blocking on discovery or informer sync at all, and make blockOwnerDeletion even more best effort.
I think I'd push to make gc stop blocking on discovery or informer sync at all, and make blockOwnerDeletion even more best effort.
I'd like to stop honoring blockOwnerDeletion. :)
cc @tkashem
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
Is there any ongoing work for this?
Is there any ongoing work for this?
None that I know of. At first glance, removing the requirement that all informers be fully synced before GC starts/resumes seems reasonable to me, and would resolve this issue.
OK, I'll try to work out a patch
I am now working on this, will fix it soon.
Hi @tossmilestone,
What is the status of this? One year has passed. Is there any short term plan to fix this?
Thanks!
Hi @tossmilestone,
What is the status of this? One year has passed. Is there any short term plan to fix this?
Thanks!
Sorry, I don't have the time right now to continue fixing this issue. If you're willing, you can help continue this work. Thank you!
@rauferna not likely in a short term. A quick solution is that you delete the converter webhook when you find your CRD controller is not working. and add it back when it recovers. Or you deploy multiple replicas to avoid the downtime as possible.
This issue has not been updated in over 1 year, and should be re-triaged.
You can:
/triage accepted
(org members only)/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted
What happened:
Creating a CRD with broken converter webhook prevents GC controller from initialization, which breaks on informer sync. Additionally, this issue is not visible until gc controller restarts - dynamically added crd resources with non-working converter webhook do not break running GC.
What you expected to happen:
GC controller should initialize with available informers. CRDs with broken converter webhook should not prevent GC controller from working on other resources.
How to reproduce it (as minimally and precisely as possible):
gc-bug.zip