Open jdef opened 2 years ago
@shomron I think we discussed this at some point. Do you know if that discussion is documented anywhere?
Definitely came up in a community meeting once.
I think this was discussed https://github.com/open-policy-agent/gatekeeper/discussions/1451#discussioncomment-1062583 (as noted in community call notes) however it seems it was deleted.
Argh, migrated to the discussions project maybe?
We turned off discussions because of opa/feedback migration, however existing discussions couldn't be transferred. I took a screenshot of it now:
Thanks for digging up that thread. I've already added a couple leader-election flags to my GK fork and will be testing soon. At some point, I'd like to nix the fork though.
Depending on the complexity/behavioral cost, I see no reason not to add it in principle. Curious to see what the PR would look like. I think @shomron was the most skeptical of this.
suggestions for a PR:
leader-election
(bool, true to enable)leader-namespace
(ns for leader leases, default to "gatekeeper-system")leader-election
flaggk-leader-12345
(or whatever)leader-namespace
flag"leases"
I'd love to submit a PR for this; sadly there's a TON of corporate red tape in the way. The above changes yield a fairly small patch that's easy to test and backward compat when leader-election
is false.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
/reopen
On Tue, Aug 30, 2022, 7:29 PM stale[bot] @.***> wrote:
Closed #1794 https://github.com/open-policy-agent/gatekeeper/issues/1794 as completed.
— Reply to this email directly, view it on GitHub https://github.com/open-policy-agent/gatekeeper/issues/1794#event-7291632631, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR5KLFXNUSOP4TEX5O6EALV32KLZANCNFSM5MDTNXRA . You are receiving this because you authored the thread.Message ID: @.*** com>
max-bot reopening @sozercan what should users do if they want to re-open a bug?
Not stale
On Tue, Aug 9, 2022, 3:23 PM stale[bot] @.***> wrote:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
— Reply to this email directly, view it on GitHub https://github.com/open-policy-agent/gatekeeper/issues/1794#issuecomment-1209784472, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR5KLDCA3FGRSFKDTYVHUDVYKV3JANCNFSM5MDTNXRA . You are receiving this because you authored the thread.Message ID: @.***>
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
not stale
@jdef Would you like to start a design doc and then bring the discussion to one of the community calls to drive this?
this seems like a pretty trivial change, is a design doc really needed? otherwise i'm happy to join a call
On Mon, Dec 19, 2022 at 10:11 AM Rita Zhang @.***> wrote:
@jdef https://github.com/jdef Would you like to start a design doc and then bring the discussion to one of the community calls to drive this?
— Reply to this email directly, view it on GitHub https://github.com/open-policy-agent/gatekeeper/issues/1794#issuecomment-1357815562, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR5KLHUPZLMNDQK2WHFN3DWOB3L7ANCNFSM5MDTNXRA . You are receiving this because you were mentioned.Message ID: @.***>
-- James DeFelice
Not sure if a full-scale design doc is needed, but would be good to know which controllers are in-scope for this (it sounds like singleton controllers only, such as status and audit), the other controllers should be running if we want hot standbys.
Also it'd be good to avoid the "webhooks cannot run next to audit with leader election enabled" edge case. Is it possible to have leader election be selectively enforced?
Of course, this does mean that a pod will self-restart if it loses the leader position, which, if serving a webhook endpoint, might have some interesting impacts to endpoint behavior.
You mentioned this is for pods not running via deployments and in other bugs you mentioned pods running off-cluster. I'm curious what the story for GC-ing stale by-pod status objects is?
Lastly, is there a reason for the leader election namespace to be different from G8r's namespace (by default this is determined via topdown and passed via env variable).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
Adding some discussion at the behest of @ritazh , carrying over from a conversation in Slack.
We are dealing with very busy audit pods on our clusters. These pods chew up 2 CPUs continuously. It would be great if the audit process could somehow be broken into workers that could be scaled horizontally and use less CPU per-worker. This would allow more efficient bin packing and reduce cluster cost (gatekeeper virtually guarantees an additional node per cluster for us).
Splitting audit into workers would also ease node replacement operations during maintenance and pod-node-reorganizing like karpenter does.
Thanks
Thanks for sharing your use case @chrissnell.
@maxsmythe @sozercan do we want to revisit this proposal?
If anyone is interested in this, please add 👍 and let us know if you want to help with the design and implementation.
Here's another issue with monolithic, singleton audit pods: they're labor-intensive to scale as a cluster scales. As a cluster gets bigger, you have to give this pod more resources. Unless you over-allocate it from the beginning, you're going to have to regularly check it to make sure that it's not resource-constrained. Breaking this into worker pods would make scaling effortless.
What is the scaling concern with singleton pods for large clusters? What resources become constrained and how does the process degrade once it becomes resource constrained?
Also curious about the CPU usage concern... audit should just take longer with less CPU allocation. Are you seeing some failure behavior with lower CPU allocations?
If it's the RAM usage for caching referential constraints, multiple pods won't help -- each pod will require the same RAM use as a singleton pod.
Otherwise, what is the unit of batching you are proposing?
Describe the solution you'd like I'd like to deploy e.g. the "audit" component statically, not using k8s deployments, in HA. It would be very convenient to leverage the controller-runtime's leader election capability for such. Enabling this via CLI flag would be ideal.
Anything else you would like to add: The program should probably crash if leader election AND webhooks were both enabled since webhooks should not depend on apiserver's leader election in a HA setting.
Environment:
kubectl version
): 1.21.7