Ansible operators usage of WATCH_NAMESPACE env variable may lead to unnecessary privilege escalations

operator-framework / operator-sdk

SDK for building Kubernetes applications. Provides high level APIs, useful abstractions, and project scaffolding.

https://sdk.operatorframework.io

Apache License 2.0

7.25k stars 1.75k forks source link

Ansible operators usage of WATCH_NAMESPACE env variable may lead to unnecessary privilege escalations #5989

Open ivandov opened 2 years ago

ivandov commented 2 years ago

Bug Report

What did you do?

This is a follow-on and related to #2461.

I deployed an Ansible Operator with no WATCH_NAMESPACE environment variable defined. This operator needs RBACs to work with some resources in the current namespace as well as resources in other namespaces.

Hypothetically, to focus on the potential security exposure or unnecessary privilege escalation here... let's say I'm creating a secret-sprayer Ansible Operator. It's purpose is to read Secrets from the current namespace and create a Custom Resource Foo in another namespace with some of the data from that Secret.

The design discussed in #2461, and also in the official operator-sdk doc around golang manager client scopes, here, seem to indicate that when the WATCH_NAMESPACE env variable is omitted, the manager client that is created will be expecting cluster-scoped authority.

It also seems that the guidance is then to move all permissions needed by your operator in the CSV into clusterPermissions in order to now work with the cluster-scoped manager.

However, in this case, the service account that is now created is given permissions to get Secrets from ALL namespaces, even though, I only need it to read Secrets from the namespace in which my secret-sprayer Operator is running. I only need the clusterPermissions to create my Foo CRs in the other namespaces!

What did you expect to see?

Ansible Operators that could correctly run with clients that interact at both the namespace scope and cluster scope as necessary.

What did you see instead? Under which circumstances?

Golang manager/client errors indicating failures to work with Secrets at the cluster scope. Even though the operator's CSV had proper namespace-scoped permissions defined for the Secret.

Environment

Operator type: language ansible

Kubernetes cluster type: OpenShift 4.10

$ operator-sdk version v1.8

Possible Solution

Create multiple manager clients, one namespace-scoped for the RBAC rules defined in the CSV's permissions block, and another for RBAC rules that are defined in the CSV's clusterPermissions block.

Additional context

I used Secrets in the hypothetical example because you could see how this could be an unnecessary privilege escalation. My operator only needs to read Secrets from the current namespace, but, I'm forced to grant it permissions to read Secrets from all namespaces.

In terms of threat modeling and ensuring privilege escalations are blocked as much as possible, this is an unintended side-effect for the way I'd like my workload to run.

camilamacedo86 commented 2 years ago

Hi @ivandov,

If I properly understand what you are looking for is;

Only have access to the secrets in the namespace where the Operator is running
However, have access to the CR and all other resources in the other namespaces.

All projects are done using Golang and controller runtime (Indeed Ansible/Helm operators are acctually Golang ones). The options to configure the manager (see https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/manager) do not allow: I want an Operator which is cluster scope (have access to the whole clusters for what is allowed with the RBAC config) but for only the specific Kind( Secret ) I want to limit the access and restrict it to a namespace X.

What you are looking for can be done via Predicates (which is nothing more than a condition if condition X return nill to stop the reconciliation). That would mean in the Reconcile you could check if the secret is or not from the namespace where the Operator is installed. Then, if not stop it.

However, as you are using Ansible it seems more complicated. You would need to make your playbook be able to check if the secret is not from the same namespace where my operator is running then ignore it.

Just to add you might want to also give a look at (but it is for Golang): https://sdk.operatorframework.io/docs/building-operators/golang/references/event-filtering/

c/c @asmacdo @everettraven

everettraven commented 2 years ago

Hi @ivandov,

My understanding of the core problem here is the inability to "scope" down operator permissions to only have the permissions they need to operate. In your particular case that would be that your operator should only have the RBAC permissions to read secrets from namespace X, while having the permissions to create the Foo CR in all namespaces.

This is a problem that we have identified in the current state of the operator ecosystem and are currently looking into different solutions. I don't have an ETA on when this functionality may be available (if at all), but there is work being done in this problem space.

For now I think the best workaround would be what @camilamacedo86 suggested.

ivandov commented 2 years ago

Thanks @camilamacedo86 and @everettraven.

So my team has had experience building both Golang and Ansible operators, as well as forking the operator-sdk and making our own extensions to the ansible operator logic for specific scenarios we wanted handled differently.

This issue was written up - not because of the need for handling this form of logic in the ansible layer - but more about the security exposures of this form of operator.

It is significantly broader in scope to make all ansible operators run with cluster permissions for a resource, when they only need to work with that resource in the current namespace scope.

From a security perspective, if my operator's service account token leaks... a user would be able to read all secrets from all namespaces! That should not be the case if the functional need of my workload is to read secrets from the same namespace in which the operator is installed.

It had seemed to me that the ClusterServiceVersion spec handled this form of permission requirements well with its clusterPermissions and permission blocks, and, I should be able to mix/match RBAC values in these two sections to hand-craft the required permissions for my workload at both the cluster and namespace scope, respectively.

I'm not intimately familiar with the considerations for handling multiple controller-runtime/manager instances, but, my overly simplistic view would be that this could be accomplished with 2 client's in the ansible-operator golang logic. One client that is configured with RBACs that are cluster-scoped from the clusterPermissions block, and another that is configured with RBACs that are namespace-scoped from the permissions block.

In both cases, these clients would use the same service account token, which would have the properly scoped RBACs set with ClusterRoleBindings or RoleBindings.

From an "overly-simplified" perspective, I should be able to hit the kube APIs with the proper cluster or namespace scoped URIs based on the permissions that were granted.

In its current state, it should be made extremely clear that ansible operators that need to work with resources that may be considered sensitive, like Secrets - should not be granted cluster-scoped permissions as the potential for abuse of the service account is seriously concerning for threat modeling.

camilamacedo86 commented 2 years ago

Hi @ivandov,

Currently, the controller runtime itself does not allow us to inform that one kind or that a list of kinds should be allowed only from a specific namespace as you are looking for. Therefore, I think we should close this one based on the above explanations.

c/c @everettraven

ivandov commented 2 years ago

A limitation in controller-runtime should not be a valid reason to close this issue - as this is a valid concern regardless of library limitations.

everettraven commented 2 years ago

I agree with @ivandov that the current limitation in controller-runtime shouldn't warrant closing this issue, but I think this issue would rather be better suited to be added to the Backlog due to the fact that there is work being done in this problem space.

@ivandov if you are interested in reading more about the work and potential solutions we are exploring I would recommend taking a look at these HackMD documents:

Information on Descoped Operators: https://hackmd.io/@of-olm/SyNl04xac
Controller being developed to help with creating RBAC to scope operators: https://hackmd.io/@of-olm/HJY44PKpq

As another part of this, I am also working on a PoC for a new type of Cache that will be dynamic and able to handle RBAC scoping in the caching layer.

I think that the problem that you describe is the exact type of problem we are wanting to solve with this new concept of Descoped Operators and RBAC templating/scoping (the problem being Operators not being able to have permissions scoped via RBAC and instead requiring all or nothing access to resources).

As I mentioned in my previous comment, I don't have an ETA as to when any of this functionality will be available. I do agree that it would be a good idea to make this limitation clearly stated in the documentation. @ivandov would you be interested in creating a PR to update the docs to mention this?

I am happy to keep this issue open for now and use it as a forum to post updates with the work that we have done regarding this problem. I hope this helps!

ivandov commented 2 years ago

Thanks @everettraven - I reviewed the Descoped Operators doc and spoke with our internal OLM Guild lead, who sounds like he's been staying in sync with the work your team is doing.

I like the approach the descoped operator pattern will be taking, it makes sense. My only concern is that the even with manual definition of RBACs for the Operator's service account, the limitation in controller-runtime sounds like it would still require elevated permissions to handle scenarios where RBAC scope needs to fluctuate between namespace and cluster.

Or - are you saying that the PoC caching solution that you are working on will address this somehow? If so, how?

Before anything is available in a cache, some kube client will have to query the proper cluster or namespace scoped resource APIs to populate the cache... so, won't controller-runtime changes still be needed to understand if a query needs to be performed against the namespace scope or cluster scope in certain cases?

I'm out of my depth here, maybe the cache is pre-populated with all resources that are able to be queried from the available RBAC scopes? Sounds like an area where a logic/data-flow disconnect can occur.

everettraven commented 2 years ago

@ivandov You are correct that there are currently some limitations in controller-runtime that makes it difficult for an operator to be able to handle this scenario.

Or - are you saying that the PoC caching solution that you are working on will address this somehow? If so, how?

The idea is that the PoC caching solution that I am currently working on will address this. As for the how, controller-runtime has a cache.Cache interface that is used for the caching layer in a controller. For now, I am working on a PoC that is essentially a Go library (Ansible operators are backed by a controller written in Go so I think it could be added to that controller) that has an implementation of this cache.Cache interface while exposing a few more functions to help operator authors facilitate handling dynamically changing permissions.

That's all the details I'll share here for now, but if you are interested in looking into it some more here are the GitHub repositories I have been working on to develop and test this PoC:

The actual PoC caching library: everettraven/scoped-cache-poc
A PoC operator that utilizes ^: everettraven/scoped-operator-poc
- Specifically look at the test/scoped-cache branch

One thing I want to note is that this is an early stage PoC and it is NOT a solution sanctioned by the Operator SDK project or the Operator Framework organization.

I will continue updating this issue as more information is available.

openshift-bot commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot commented 1 year ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten /remove-lifecycle stale

everettraven commented 1 year ago

/lifecycle frozen