open-policy-agent / gatekeeper

🐊 Gatekeeper - Policy Controller for Kubernetes
https://open-policy-agent.github.io/gatekeeper/
Apache License 2.0
3.71k stars 763 forks source link

Deprecate control-plane labels #1061

Open sozercan opened 3 years ago

sozercan commented 3 years ago

Describe the solution you'd like [A clear and concise description of what you want to happen.] control-plane labels are leftovers from kubebuilder 1.0. They can be deprecated but we should give users time to move off using these labels.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

shomron commented 3 years ago

+1 for deprecation, however since deployment spec.selector is immutable after creation, we need to consider documentation for the upgrade process.

maxsmythe commented 3 years ago

What is the normal upgrade path for something like this? Deleting the deployment and recreating?

cristiklein commented 3 years ago

Could someone clarify why the deprecation of the control-plane label is necessary?

It seems to me that the control-plane label is an established (although not fully "standardized") way of marking system namespace, for which admission controllers should be excluded. At least AKS advocates that approach.

We are about to contribute a PR to kubespray to add said label to all clusters created with kubespray.

Would it be possible to keep the control-plane label for webhook exclusion in Gatekeeper?

maxsmythe commented 3 years ago

TBH I have no strong feelings about the existence (or lack thereof) of a control-plane label on Gatekeeper resources.

I do think that the label being under-defined is a problem. What should the significance of control-plane be? What is an appropriate value for that label to have? I think that, without an authoritative answer to this question, the notion of using this label to make meaningful decisions is potentially brittle as different projects may use it in different ways.

From a security standpoint, I'm not in love with the idea of using labels to control which resources are subject to a webhook and which are not. This makes the ability to set labels equivalent to the ability to bypass policy checks. It can make sense, but needs to be done carefully to avoid unintended consequences.

Historically, Gatekeeper had that label because it was auto-generated by Kubebuilder v1. I think it was then applied to the gatekeeper-system namespace and used as a way to bypass the G8r webhook to avoid self-management until it was replaced by the (more locked-down and explicit) admission.gatekeeper.sh/ignore label. Now it exists as a kind of vestigial appendage (or so I thought).

As I said, I am kind of "meh" about the label with a lean toward removing it just to avoid bikeshedding over label values, etc. If there is an emerging standard that should be something to consider.

Curious to hear others' opinions.

shomron commented 3 years ago

What is the normal upgrade path for something like this? Deleting the deployment and recreating? Yes, I believe that is the only way. That implies a temporary lapse in enforcement as that environment rolls over to the new deployment.

Would it be possible to keep the control-plane label for webhook exclusion in Gatekeeper?

I think we're talking about several different changes all in one.

  1. Removal of the control-plane labels from the Deployment's pod selector. This is internal cleanup and reduces the chances of different kubebuilder-generated deployments in the same namespace from conflicting with each other (trying to steal each other's pods). Even if they were left in place, I think the generic value control-plane: controller-manager on the GK webhook deployment is risky.
  2. Removal of the control-plane labels from the gatekeeper-system namespace - again this is auto-generated and its significance is not defined anywhere. Gatekeeper already defines a new label with well-defined semantics for avoiding self-managing and having both set on the namespace does not provide any value I can see.
  3. Removal of the control-plane as a default exclusion in the GK reference ValidatingWebhookConfig - if we do this, we should call it out in our release notes and give users clear guidance on a safe transition and how to keep the old behavior if they want.

The control-plane label does not follow best-practices as it is not prefixed in any way. As a reference see some well-known labels documented by Kubernetes here and here. My opinion is that it is better to phase it out than to perpetuate it into other projects like kubespray.

My 2c.

ritazh commented 3 years ago

+1 on phased approach to deprecate this.

We can also introduce a new field in the helm chart to allow users to provide this label via helm values.yaml. It can initially default to control-plane then after few releases with warnings in our release notes, we can remove it from the default values.yaml but provide steps to help guide users to set it in their own values.yaml.

cristiklein commented 3 years ago

I see your point. The control-plane label is too generic and needs to be replaced with something more intuitive. I retracted the kubespray PR and we'll look for a better alternative for Compliant Kubernetes.

sozercan commented 3 years ago
  1. Removal of the control-plane as a default exclusion in the GK reference ValidatingWebhookConfig - if we do this, we should call it out in our release notes and give users clear guidance on a safe transition and how to keep the old behavior if they want.

This has been removed starting with v3.1.0: https://github.com/open-policy-agent/gatekeeper/pull/758

ritazh commented 3 years ago

Closed via #758

sozercan commented 3 years ago

@ritazh we still have control-plane references in our manifests. #758 only removed from vwh.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

tspearconquest commented 2 years ago

Hi, is it currently safe as an end user to remove control-plane from the gatekeeper-system namespace? I believe that label is contributing to https://github.com/open-policy-agent/gatekeeper/issues/2142 as per my comment on that issue. Please see comment for more detail

Edit: I believe this may also be causing https://issuetracker.google.com/issues/204835306?pli=1

ritazh commented 2 years ago

Is it safe to remove the control-plane: controller-manager label from the gatekeeper-system namespace currently, if we have already applied the admission.gatekeeper.sh/ignore: no-self-managing label?

If you already have the admission.gatekeeper.sh/ignore: no-self-managing label, then yes it's safe to remove the control-plane label.

I think we have waited long enough to remove the control-plane label everywhere. WDYT? @sozercan @maxsmythe I can open a PR to remove it.

srmars commented 2 years ago

I am having admission.gatekeeper.sh/ignore: no-self-managing and control-plane=controller-manager label in gatekeeper-system namespace. For testing, to fix the TLS handshake error, I tied to remove the control-plane label from gatekeeper-system namespace and its deployments. After the rolling restart of the gatekeeper deployments the control-plane labels got added automatically back to gatekeeper namespace and deployments. Do we need to update some where to make it permanent

gatekeeper version - v3.8.1

tspearconquest commented 2 years ago

Hi @sri53 how do you deploy Gatekeeper? Is it by Helm, or by K8S manifests, or by some other method? Also what version of Gatekeeper and Kubernetes do you deploy to?

srmars commented 2 years ago

Hi @tspearconquest how do you deploy Gatekeeper? Deployed through Azure Policy Add-on for AKS what version of Gatekeeper - 3.8.1 Kubernetes version - AKS 1.23.12

tspearconquest commented 2 years ago

I believe Azure Policy Add-on enforces the label in your case, so if that's correct then removing it for your gatekeeper instance won't be possible until it has been removed in upstream and Azure Policy upgrades to a version which has it removed.

tspearconquest commented 2 years ago

Hi @ritazh - It seems my suspicion was not correct, and removing the control-plane label did not help.

It's really interesting that this is only affecting Gatekeeper, as we do have other tools with MWH and VWH which do not see this problem, and the traffic causing the errors is 100% coming from the konnectivity-agent pods in kube-system

I also took a look in konnectivity configmap and deployment manifest in one of our clusters to see if I could find a log format option, but I'm afraid I couldn't find any. My main concern is that these are not coming in json format, so it causes a lot of spam for our fluentd instance to try to parse non-json log outputs as json.

ritazh commented 2 years ago

Thanks for the update @tspearconquest! I can dig into this a bit later and report back.