open-policy-agent / gatekeeper

🐊 Gatekeeper - Policy Controller for Kubernetes
https://open-policy-agent.github.io/gatekeeper/
Apache License 2.0
3.61k stars 739 forks source link

default webhooks resources * cause tons of unnecessary traffic #545

Open grosser opened 4 years ago

grosser commented 4 years ago

we summed up the kinds our rules use and applied them as resources ... traffic and cpu usaged dropped substantially :)

Screen Shot 2020-04-08 at 3 31 08 PM

issue is here: https://github.com/open-policy-agent/gatekeeper/blob/176f5d531096e92f93ef63844c48f66bc7dde53d/chart/gatekeeper-operator/templates/gatekeeper.yaml#L575

maxsmythe commented 4 years ago

Definitely worth mentioning the '*' in the README as a potential optimization.

We wouldn't want to have anything other than '*' in the default VWH config, as requiring users to modify two configs to get a constraint working is confusing and poor UX. As an example, there was considerable user confusion around needing to set up syncing for audit to work.

Re: dynamically setting the kind matcher...

Requiring the user to set kind selectors is not desirable as kind should be no more or less special than any other sort of matching criteria.

It could be argued that templates should project what kind they are interested in. However, some templates are relevant to all kinds, K8sRequiredLabels being the classic example. This opens the user up to surprising consequences. If a user were to add a high-traffic kind, their QPS would spike, potentially breaking their cluster. Defaulting to checking everything avoids these spikes at the cost of higher baseline QPS.

Unless the default CPU/RAM usage required to handle peak traffic is lowered considerably in an absolute sense (i.e. relative to total cluster footprint rather than, say a 50% reduction from current GK resource usage levels), then the usability costs to the user, the QPS trap (which could cause cluster downtime once we go fail closed), and the extra programmatic complexity don't seem worth it.

1 CPU doesn't seem too expensive to my eyes. I could see a 512 MB of RAM turning heads, though that's mostly due to caching and not request load.

Because the current state of affairs doesn't preclude this VWH config optimization and because of the above cost/benefit analysis or trying to do something fancier coupled with the low current costs, I'd prefer we start adding a "tuning" section to the Gatekeeper README. There are other things we could add there too, such as reducing audit frequency.

grosser commented 4 years ago

Sounds good.

So far we always ran into trouble when not setting kinds because things like events/endpoints etc usually don't match the rules anyway.

On Thu, Apr 9, 2020 at 1:03 PM Max Smythe notifications@github.com wrote:

Definitely worth mentioning the '*' in the README as a potential optimization.

We wouldn't want to have anything other than '*' in the default VWH config, as requiring users to modify two configs to get a constraint working is confusing and poor UX. As an example, there was considerable user confusion around needing to set up syncing for audit to work.

Re: dynamically setting the kind matcher...

Requiring the user to set kind selectors is not desirable as kind should be no more or less special than any other sort of matching criteria.

It could be argued that templates should project what kind they are interested in. However, some templates are relevant to all kinds, K8sRequiredLabels being the classic example. This opens the user up to surprising consequences. If a user were to add a high-traffic kind, their QPS would spike, potentially breaking their cluster. Defaulting to checking everything avoids these spikes at the cost of higher baseline QPS.

Unless the default CPU/RAM usage required to handle peak traffic is lowered considerably in an absolute sense (i.e. relative to total cluster footprint rather than, say a 50% reduction from current GK resource usage levels), then the usability costs to the user, the QPS trap (which could cause cluster downtime once we go fail closed), and the extra programmatic complexity don't seem worth it.

1 CPU doesn't seem too expensive to my eyes. I could see a 512 MB of RAM turning heads, though that's mostly due to caching and not request load.

Because the current state of affairs doesn't preclude this VWH config optimization and because of the above cost/benefit analysis or trying to do something fancier coupled with the low current costs, I'd prefer we start adding a "tuning" section to the Gatekeeper README. There are other things we could add there too, such as reducing audit frequency.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/open-policy-agent/gatekeeper/issues/545#issuecomment-611726633, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACYZ444HJZSRIFGXDRK3DRLYS23ANCNFSM4MEJB5ZQ .

maxsmythe commented 4 years ago

Yeah, I think we've discovered a best practice for templates where we perform a test to make sure we are executing against an appropriate group/kind before triggering a violation.

tspearconquest commented 2 years ago

Hi, would it be possible to get that tuning section added to the readme? Thanks in advance.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

tspearconquest commented 2 years ago

Please don't let this be closed. Please provide an update.

maxsmythe commented 2 years ago

@tspearconquest Is your concern outgoing request volume from the API server, or G8r scalability?

If the latter, have the newer releases made this less relevant?

tspearconquest commented 2 years ago

I'm afraid it's the former.

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: Max Smythe @.> Sent: Friday, July 29, 2022 10:11:56 PM To: open-policy-agent/gatekeeper @.> Cc: Thomas Spear @.>; Mention @.> Subject: Re: [open-policy-agent/gatekeeper] default webhooks resources * cause tons of unnecessary traffic (#545)

CAUTION: This email originated from outside the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.

@tspearconquesthttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftspearconquest&data=05%7C01%7Ctspear%40conquestcyber.com%7C52882705caf74d11408708da71d94316%7C8c73de7821b84aa8885ecb22616eb322%7C0%7C0%7C637947475198144884%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RJsxsbw8erD%2BpzbzaQKXMyy4ZnfSDdH38v9wddxWKvA%3D&reserved=0 Is your concern outgoing request volume from the API server, or G8r scalability?

If the latter, have the newer releases made this less relevant?

— Reply to this email directly, view it on GitHubhttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fopen-policy-agent%2Fgatekeeper%2Fissues%2F545%23issuecomment-1200076357&data=05%7C01%7Ctspear%40conquestcyber.com%7C52882705caf74d11408708da71d94316%7C8c73de7821b84aa8885ecb22616eb322%7C0%7C0%7C637947475198144884%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=xHFdL%2FNpzqvA%2F7ubbe2omGpgn1Zlzz%2F8Ye38bCVLjpU%3D&reserved=0, or unsubscribehttps://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FATRTFZ2GUXT2R37ZP4EZFXDVWSMPZANCNFSM4MEJB5ZQ&data=05%7C01%7Ctspear%40conquestcyber.com%7C52882705caf74d11408708da71d94316%7C8c73de7821b84aa8885ecb22616eb322%7C0%7C0%7C637947475198144884%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hQS70%2BDIlrlwtsbrPCytKzTOCBz7yA6JOL3PWci9R%2Fk%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

maxsmythe commented 2 years ago

ack, thanks for the context

grosser commented 1 month ago

FYI another gotcha: it still needs to validate all constraints or the constraint crds no longer get validated issue so needs to be all used resources +

  - apiGroups:
    - constraints.gatekeeper.sh
    apiVersions:
    - "*"
    operations:
    - CREATE
    - UPDATE
    resources:
    - "*"