open-gitops / project

Repository for top-level information about the OpenGitOps project
https://opengitops.dev/
Other
959 stars 43 forks source link

Security focused group? #32

Closed mikecali closed 2 years ago

mikecali commented 3 years ago

GitOps Security Considerations Securing GitOps is very important especially in an automated environment. If we don’t take security into consideration upfront we are opening Pandora’s box. A multitude of things can go wrong if security is an afterthought.

  1. Security must be part of every design and should be continuous.
  2. GitOps RBAC - controller?
  3. Embed code validation steps
  4. Git Repository Governance.
  5. Level of access to different personalities.
  6. Imposing Git best practices all the time.
  7. Systems/Environments access control.
  8. Who can access different tooling.
  9. Making sure access are logged.

This is just a thought that might need consideration for this project.

mikecali commented 3 years ago

Thanks, @lloydchang - The reason I suggested this is because we always see in the real-life implementation that "security considerations are left behind".

I have seen some of those videos and they are indeed good. My thinking is providing guidance upfront for security is needed.

roberthstrand commented 3 years ago

I agree that security is something that we should try to cover. We haven't really been talking about security so far, but it's a topic that can get very advanced and we can help by creating some guidance on this topic. It doesn't have to be very big effort either, and as long as we have clear goals on what we want to achieve then I'm all for helping to facilitate it. I'll bring it up in the meeting today.

todaywasawesome commented 3 years ago

I agree 💯 @mikecali

Would you be interested in starting a greenpaper/whitepaper on this topic?

mikecali commented 3 years ago

I will be happy to contribute to creating a whitepaper. Let me start putting together some ideas on the content and will share them back here.

@todaywasawesome @lloydchang @roberthstrand - thanks for your support.

roberthstrand commented 3 years ago

@mikecali - Since these documents are going to live in the documents repository, I have suggested to enable discussions to make work on specific white/green papers or pattern descriptions easier. Issue can be found here.

At that point brainstorming specific ideas would be easier, at least I hope so.

yebyen commented 3 years ago

I would be interested to contribute further to the discussion @lloydchang about how webhooks of different forms can have compromised the security posture, and what promises the different mitigations that are possible can offer to repair or obviate them. re: https://github.com/open-gitops/project/issues/32#issuecomment-964950127

I consider from my perspective that the addition of a new timing source which does not carry any specific utilized information, other than the timing for when a specific Source reconciliation should be attempted early, does not compromise much. But in the lens of Simple Is Hard, I can see how those specific caveats might introduce a lot of complexity that includes some promises, and certain conditions which we might not want to entertain, so I consider too this is a good topic with a lot of specific nuance in it.

yebyen commented 3 years ago

I view this as fear mongering, simply put the Flux webhook attack surface is very tight, but there can always be room for improvement. I'm interested in a benchmark that shows how even a generic webhook with no signature verification can be exploited in the Flux notification controller, I am skeptical that it can break at any threshold of load based on what happens.

I would be more skeptical of signed, verified webhooks, since checking the signature and validating payload invokes more code paths than simply triggering a reconcile of Flux's GitRepository resource, which is what happens when Receiver is properly implemented. If this works how I expect, then Flux should only pull the head commit from the Git Repository branch, a lightweight operation which can abort as soon as two pre-computed SHAs are compared (the cached HEAD and the remote) and seen not having changed at all, the operation should take practically no calculation or bandwidth at all. (Verifying a signature, on the other hand, depending on algorithms used could surely be susceptible to DDoS in cases where this approach would have not been.)

Not all implementations can be said to have been properly implemented, but neither can the software take responsibility for improper implementation that falls outside of its scope of control. (You can misconfigure a firewall and that doesn't make Flux's approach to webhooks any worse for design.)

If there is a security issue in Flux's notification controller, then let's expose it and offer steps to mitigate the issue. If there are no provably insecure paths then we can at least allay some concerns and perhaps put the issue to rest, with respect to Flux at least.

kingdonb commented 3 years ago

fluxcd/notification-controller#256 reported a potential issue, I simply mean that would like to understand how it can be exploited. I mean let's have benchmarks that show, given resources (compute nodes of size X) an attacker who exploits the (signed, or unsigned/unverified) webhook can currently express the webhook in order to saturate cluster resources.

If there is a need for rate limiting here, then let's show the risk exposure and if possible impose the necessary rate limiting.

Simply adding a note that "here, a door can be configured as closed/locked, or it can be left open" is not adding any serious mitigation. Webhooks can be signed or unsigned, and their payloads can be validated or not. Either way, according to the design of Flux's notification controller, the payload is a nonce and can be disposed without processing. This seems objectively safe to me, but I don't have benchmarks to prove it.

If this is an attack surface that can be exploited, let's show the impact of the exploit (with benchmarks.) Then we can approach to mitigate the impact, if it is detectable, but until data shows this is an exploitable entryway, we cannot turn it off. The performance impact of reconciling too often is "the devil we know." As the number of Reconciler resources grows, this has a very real impact that you can measure on control plane performance. In that sense, I mean to say that: the notification controller's Receiver feature allows us to mitigate the devil we know.

I believe @yebyen is merely suggesting, let's hear more about this devil we don't know, the attack vector of hitting the receive hook with unnecessarily high and abusive frequency. How fast exactly do you have to trigger it before you can detect any observable impact on performance? (And for documentation that closes the loop, it would be good to include further notes explaining – how can one further mitigate this impact with a correctly configured WAF, restricting the sources to those that come from within trusted networks that remain inside our explicit control?)

yebyen commented 3 years ago

Adding more reconcilers (GitRepository, Kustomization) is the mechanism by which we scale GitOps to work effectively with many apps (and even many tenants) inside a cluster. So to clarify what I mean, while the performance impact of adding many reconcilers is real, the performance impact of adding more manifests to a single reconciler until scale is reached, is also real, and creates a single point of failure, which is worse.

It can be in scope for individual GitOps projects to deliver these benchmarks, I agree it wouldn't be in scope for OpenGitOps to do it all; but perhaps it could be in scope for the project to provide a prescriptive framework for such benchmarks. I think that the upstream-downstream model of Kustomization which subscribes to GitRepository to intake and apply changes will perform impressively, even under the threat model presented here, I just don't have benchmarks to prove it. (Nor are my clusters "at scale" for now at least.)

Either way I meant to point out that we don't have great benchmarks for this threat model, or for at-scale deployments at large. Users will want to know at what scale their control planes will start to feel overwhelmed (in or outside of any threat model we may have to add to the conversation about performance.)

The threat model of course has more to consider than only performance and DDoS resilience. There could be risk of compromises that are not even related to performance.

pjbgf commented 2 years ago

Folks on this thread may also be keen to take a look at: https://github.com/open-gitops/documents/issues/57

roberthstrand commented 2 years ago

I am closing this issue in favor of #128. If anyone wants to join forces and create content around GitOps security, please volunteer in that issue.