projectcalico / calico

Cloud native networking and network security
https://docs.tigera.io/calico/latest/about/
Apache License 2.0
5.92k stars 1.32k forks source link

tigera-operator operates with effective cluster-admin #3988

Open dghubble opened 4 years ago

dghubble commented 4 years ago

Expected Behavior

Without Tigera operator, Calico was deployed with a restrictive ClusterRole, that could mostly just get/list pods (example).

Current Behavior

Calico v3.16 docs and releases appear to be favoring Tigera Operator more.

I've seen some projects introduce an operator as a mechanism to create the manifests that a user would typically just create directly (rather than for custom APIs). I imagine that's the story behind Tigera Operator. The cost is that Tigera Operator is using high levels of access to the cluster (ClusterRole)

Effectively, Tigera Operator is running as a cluster-admin.

Possible Solution

Calico continues to maintain (operatorless) manifests. Recommends them for production. Components are created dirctly (e.g. DaemonSet, etc). Production users continue to use a limited ClusterRole.

Context

What are you trying to accomplish?

Maintaining direct control over what is applied to clusters and using limited RBAC.

Your Environment

fasaxc commented 4 years ago

Thanks for raising the concerns; would be great to see if we can tighten some of those permissions up.

I don't think we're planning to sunset the manifests in the near term but there are many drivers for the move towards the operator and it is not intended to be "dev only".

  1. It's a hard requirement for OpenShift. It's the only way that they allow production components to be installed.
  2. As the number of types of k8s cluster goes up (kops, kubeadm, The Hard Way, AWS DIY, EKS with AWS VPC CNI, EKS with Calico CNI, GCP DIY, GKE, on-prem, IKS, AKS, Rancher, OpenShift, Docker EE, Typhoon, ...) it's getting harder and harder to install Calico and harder and harder for us to cover all the cases; there are lots of different moving parts that need to be adjusted for each environment. The operator helps to detect the right environment and apply the right tweaks.
  3. Avoiding foot-guns. With our CRDs in the main manifest, we've had people delete the main manifest as part of their upgrade and with it all the CRD resoruces for their running cluster :scream: . We also have people adjust the image version in the manifest without properly updating the rest of the manifest; the operator is intended to make sure all of that is handled correctly.
  4. Ease of use.
  5. Calico Enterprise has even more moving parts and extra initialisation steps that need to be done in order, the operator helps to keep that in check.

If the community would stop releasing new k8s distros, maybe we could backtrack on the operator; deal? :laughing:

fasaxc commented 4 years ago

FWIW, a lot of the permissions we have come from the set-up on OpenShift, where the operator has to run in one namespace and the Calico components in another (and still more for the Calico Enterprise components). Since Secrets are namespaced and multiple components need the secretes, we need the operator to copy various secrets from its namespace to others. Would be great if we could lock that down to specific named namespaces and secrets.

caseydavenport commented 4 years ago

We should be able to lock down to specific named resources using RBAC as well. That was the intention from the beginning, but we haven't gotten that far yet!

caseydavenport commented 4 years ago

Will you continue to maintain (operatorless) manifests?

There is no intention to remove support for these in the near to medium term. However, as we continue to make strides with the operator our intention is to more and more strongly recommend that approach. That's of course all subject to change based on the unknowable future!

Feedback like this issue is really important to receive and will help us make sure the operator approach is meeting all of the right community needs, as well as the purposes that it originally set out to do.

Do you intend to make Tigera Operator a required component?

Similar to the above, definitely not in the short to medium term. Maybe some day, but it's pretty hard to see that happening.

A similar story - it wasn't too long ago that installing Calico using a DaemonSet was new and exciting, and worrying to some! Most users installed Calico underneath Kubernetes rather than on top of it. We switched, got lots of feedback, incorporated it, and nowadays very very few users are not using a DaemonSet to install Calico, but it is still possible to do so. My hope is that we go through the same process with the operator.

dghubble commented 4 years ago

Thanks for the details and thought around this.

I definitely understand you want to provide ease of use and support many different platforms/distros/users. Relevant to this is trying to design for both end-users (humans following Calico tutorials who may find value in an installer doing things for them) and designing a component for use in other clusters/systems.

As a distro, my end-users won't see whether the cluster came with the right manifests or an operator added them. They will see that the RBAC profiles increased in scope. I can't speak to Openshift and their needs. But at the moment, in my distro, tigera operator would be an unprecedented level of access, but I can stick to loosely basing off the operatorless manifests as I've done historically.

I know operators are pretty open ended. Some focus on managing their own defined custom resources and that seems pretty reasonable (foo operator manages Foo resources, sure). Some also take on being installers and that worries me some. Its hard for me to justify handing out cluster-wide access to ClusterRoles, Deployments, DaemonSets when I could just as well have those made at cluster bootstrap, leaving the steady-state RBAC strictly limited.

I do wonder if there are other ways you could aim to reduce complexity and ease maintainability, perhaps complimentary to having an operator installer. While I don't have the expectation that Calico provides a tutorial for every case (part of the distro's job), I'm sure your issue tracker feels that burden regardless. Another angle might be borrowing ideas from other CNI providers (central configmap vs env vars, MTU detection, delegation of vendor examples, clamp down on CRD creep). I don't presume to know solutions, I don't and I'm sure its something you're always thinking about. But I do see the problem too. Calico is the default in my distro, but it does admittedly have the most moving parts.

dghubble commented 4 years ago

In fairness, I don't want to leave a broad issue that isn't tracking a concrete thing. You've both provided insights into the operator/installer's direction and goals (which answers my initial questions, thx) and I've shared some counterpoint concerns about it to maybe consider.

I can close if you want to track potential RBAC scope reductions somewhere else.