Open thesuperzapper opened 5 months ago
@realshuting @aslafy-z @eddycharly @chipzoller @JimBugwadia sorry to tag you all again, but I think this is a serious issue for Kyvenro that needs your input.
I think it will need a short-term fix for 1.12 (to ensure that the generate
rules remain usable), and a longer-term fix (possibly removing UpdateRequets) to make Kyverno more scalable and establish it as a "generic controller"
After looking into this, the issue seems to be UpdateRequest magnification. Why it's happening isn't clear. If you isolate the policy to include just a single rule, only a single UR is created per destination Namespace (so two total, one for target-namespace-1
one for target-namespace-2
). When you add the second rule, this multiplies to 4 per ConfigMap per Namespace and so on. It clearly appears to be a bug. Creating multiple generate rules in the same policy with effectively the exact same scope is fairly uncommon. I was able to confirm that isolating one rule per policy solves the issue and only one UR per ConfigMap per Namespace is created as expected.
This is clearly something that needs to be investigated, it just may have to be a patch release.
We ran into this issue when using a policy to sync the image pull secret to all new NSs, in one of our clusters it crated thousands of updaterequests
till it started crashing the API server, had to stop all kyverno pods and then it took me several hours running:
kg updaterequests -A | awk '{print $2 " -n " $1}' | xargs -L 1 -r -P25 kubectl delete updaterequests
To recover the cluster.
The policy looks something like this:
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: sync-pull-secret
annotations:
policies.kyverno.io/title: Sync Pull Secret
policies.kyverno.io/category: Generate
policies.kyverno.io/subject: Secret
policies.kyverno.io/minversion: 1.6.0
policies.kyverno.io/description: >-
Syncs pull-secret from platform namespace to all labeled namespaces and istio namespaces
spec:
background: true
generateExistingOnPolicyUpdate: true
rules:
- name: sync-pull-secret
match:
any:
- resources:
kinds:
- Namespace
selector:
matchLabels:
my-label: my-value
- resources:
kinds:
- Namespace
names:
- istio-system
- istio-ingress
generate:
apiVersion: v1
kind: Secret
name: pull-secret
namespace: "{{request.object.metadata.name}}"
synchronize: true
clone:
namespace: platform
name: pull-secret
exclude:
resources:
kinds:
- Namespace
names:
- platform
@chipzoller @realshuting do we know if this will be fixed in Kyverno 1.12.0? And if we know the root cause?
Because it's a pretty serious issue that will keep me on 1.10.0 (which still has a less intense version of the problem).
After looking into this, the issue seems to be UpdateRequest magnification. Why it's happening isn't clear. If you isolate the policy to include just a single rule, only a single UR is created per destination Namespace (so two total, one for
target-namespace-1
one fortarget-namespace-2
). When you add the second rule, this multiplies to 4 per ConfigMap per Namespace and so on. It clearly appears to be a bug. Creating multiple generate rules in the same policy with effectively the exact same scope is fairly uncommon.
With 1.11+, the UR is created per rule per matching resource. If your policy has a single rule and two matching resources (target-namespace-1
and target-namespace-2
), 2 URs are created. When you add the second rule, another two URs are created, one for each namespace.
I was able to confirm that isolating one rule per policy solves the issue and only one UR per ConfigMap per Namespace is created as expected.
The total number of URs is fixed and determined by the number of rules and matching resources, no matter how you build the policy - one policy with two rules or two policies and each has a single rule yields the same number of URs. Why it solves the issue?
Hi @thesuperzapper - we have released v1.12.4-rc.2
to reduce the UR creations specifically when generateExisting=true
(also applies to BACKGROUND_SCAN_INTERVAL
).
Could you help us verify if this reduces your overall updaterequests count?
Kyverno Version
1.11.4
Description
Introduction
There has been a significant decrease in the performance of
generate
type policies since Kyverno1.10.0
.For example, a ClusterPolicy that generates 18 ConfigMaps (across 2 namespaces) used to take ~1min and generate 60 UpdateRequests in
1.10.0
, but takes ~3min and generates over 150 UpdateRequests in1.11.4
.But even aside from this regression, I think there is a larger issue with how Kyverno is implementing
generate
rules, which is making it not scale properly (see conclusion for more info).Testing
In an attempt to understand when this issue happened, I have tested many versions, but have found that all perform significantly worse than
1.10.0
.Here is some information about the testing:
rancher/k3s:v1.26.12-k3s1
imageTest Results:
main @ dd46f9eaf0342a0b617f1bf2b328b21ff14e14e9
:1.11.4
:1.10.6
:1.10.1
:1.10.0
:Test Notes:
1.10.0
, something seems wrong. How is it possible that so many UpdateRequests are required to create 18 ConfigMaps?BACKGROUND_SCAN_INTERVAL
sync, in addition to on first apply withgenerateExisting
.Conclusion
There must be something foundationally wrong with our current generate policy implementation (especially in newer versions, but also in 1.10.0). It doesn't make a lot of sense why hundreds of UpdateRequests and multiple minutes are required to create such a small number of resources. Other controllers regularly create/update hundreds of resources in less time and with minimal cluster load.
I have two main requests for the Kyvenro project:
generate
rules create so many UpdateRequets, and reduce it to at most one per actual create/update (to allow users to update past 1.10.0).Personal Notes
I think there is a lot of demand for a "generic controller" that can create arbitrary resources based on user-defined conditions. Kyverno is well-positioned to become the standard here IF we resolve these performance issues.
As a possible solution, I suggest we look at how other controllers generate resources. In my experience, most use "WATCH" requests to efficiently detect changes, so they only have to do a "full reconcile" each time they boot up, and none use anything like our UpdateRequest (except for slow external dependencies, like cert-manager's CertificateRequest).
Appendix
Target Namespaces
Generate Policy 1
Slack discussion
No response
Troubleshooting