ory / keto

Open Source (Go) implementation of "Zanzibar: Google's Consistent, Global Authorization System". Ships gRPC, REST APIs, newSQL, and an easy and granular permission language. Supports ACL, RBAC, and other access models.
https://www.ory.sh/?utm_source=github&utm_medium=banner&utm_campaign=keto
Apache License 2.0
4.7k stars 342 forks source link

Moving forward with ORY Keto #47

Closed aeneasr closed 5 years ago

aeneasr commented 5 years ago

I recently (re-)discovered the OPA project. This issue is about deprecating the ORY Ladon engine and aligning ORY Keto with OPA. The decision is not yet made and we are looking for valuable input regarding this.

OPA allows you to write authorization logic using a language specifically designed for that, called rego. Syntax is very go-like. Due to this, OPA is capable of providing all sorts of authorization mechanisms, like RBAC, ABAC, ACL, AWS IAM Policies, and more. In fact, I believe that ORY Ladon's logic is implementable using rego. I'm not sure if that holds true for conditions which still needs verification from my side.

Let's take a look at the current downsides of each project.

I believe that policy documents as implemented by ORY Ladon are very powerful, but also very complicated. Many developers struggle with proper resource & action naming. I think that regular expressions have their place here, but many developers struggle with writing and testing regular expressions and variable substitution is very flaky (currently only used in the ORY Oathkeeper adapter for ORY Keto iirc) from a ux perspective. Also, regular expressions do not scale well, especially if read from the database. I think we can fix this with caching, but that is not fixing the problem itself, only the symptom.

On the same hand, OPA is limited. I think rego is great for developers that really want to jump into this (like me). But it's a new syntax, new language, and new tools. I think the language is not incredibly intuitive and not always readable:

I believe that policy documents are, in general, quite complicated. In my opinion, rego has a steep learning curve as well. Can you tell me immediately what this does?

sod_roles = [
    ["create-payment", "approve-payment"],
    ["create-vendor", "pay-vendor"]
]

sod_violation[user] {
    role1 = user_role[user][_]
    role2 = user_role[user][_]
    sod_roles[_] = [role1, role2]
}

At least I can not - the point I'm trying to make is, you have to learn this.

OPA comes with a REST API but it's really a parser and execution engine. It parses rego files and executes the logic based on data you provide. The result is always true or false, depending on the authorization result.

The server is limited. It stores everything in-memory so pushing logic to the server is not realistic with more than one server running. Instead, you'll probably have to write a CI pipeline which builds a docker image that has all your policy definitions. IMO that can be very nice, especially if you have rego test as part of that pipeline. But, from my experience, most people do not want or know how to do these things.

Coming back to policies for a moment: Most developers do not need AWS IAM Policies, simple role management is enough. By the way, the Google Cloud Platform migrated completely to RBAC/ACL as well recently (at least in the UI) - very few people want to deal with complicated JSON documents. And many make mistakes, which is evident due to the many S3 leaks we're seeing recently (caused by misconfigured buckets, well really misconfigured AWS IAM Policies). I think the same would have happened if AWS used rego, or it would have probably caused less people to use this feature at all.

My vision for ORY Keto is to have a "policy decision point" (someone that says if something is allowed or not) that just works. I also believe that it should be possible to use several well-known patterns out of the box, this includes RBAC, ACL, ABAC and ORY Ladon Policies (for BC compatibility) and well, maybe your own rego definitions? I will experiment this month with different concepts and try to migrate Ladon Policies on top of rego. My preliminary tests showed that we can get a 10x performance improvement for simple use cases. We'll see how well this does for advanced ones.

We write this software is for you, so please participate in the discussion and leave your ideas and comments below.

MOZGIII commented 5 years ago

From the community standpoint, I value diversity. Landon and OPA are alternative solutions to the same problem, and it’s great. I did consider both for use in my latest project (used another solution though). It’s great that we have a choice. So, this is the first point. Second, if you think ladon is working - why bother with swapping it? I didn’t use it, but it seems not bad. OPA has different drawbacks, but it doesn’t mean it’s better. I’d rather stick to the tool I built myself than using something else, especially when in doubt if candidate solution is any better. Of course, you have to try to really understand such things, so if you really want to estimate OPA - why not have support for both? Should not be a very difficult thing. It will let advanced users to do the testing. I think decision on what to use after all will take time.

MOZGIII commented 5 years ago

Btw, I really like what you desribed about rego server (stateless service, that’s intended to be configurable via code as policies). I feel like it’s so much better than having a database backed solution for all my usecases. Are there really people that live without CI pipelines nowadays? It’s sligthly off topic, but it’s an important estimate for the overall design.

fredbi commented 5 years ago

Here is my 2 cents.

In our view, the most appealing part of ORY was the proxy AND the authorizer, that is keto.

What we liked was that these building blocks where more or less independent.

After experimenting with keto, I liked the powerful, yet very simple layout. Further, Ladon is designed to be extensible -possibly with open-source contribs.

I think we should value and maintain such simple solutions, even though a more complex setup with a DSL etc.. is also interesting.

Bottom line, my piece of advice would be:

So this issue should rather move to oathkeeper, after pluggable features are made available.

aeneasr commented 5 years ago

Hi there! Thank you for your input. I feel you when it comes down to complexity. Here is my initial analysis of OPA:

Stability

It's tagged as 0.x, so I guess this is to be expected? Although - IMO - panics shouldn't happen in 0.x releases either.

Panics

Chars . and - cause panics:

$ opa run
> .
panic: runtime error: index out of range

I'll probably run a fuzzer on this library to see what else we can do with this.

Tabs vs Spaces

Using copy & paste in opa run REPL with tabs causes weird paste behaviour:

> x := [{
    "bla": "bla"
}]

x = [{
| data.repl.version"bla": "bla"
| }]
| 
1 error occurred: 2:18: rego_parse_error: no match found
    data.repl.version"bla": "bla"
                     ^

Replacing tabs with spaces fixes that:

x := [{
    "bla": "blubb"
}]

Functionality

CIDR

As far as I can tell, CIDR matching is not supported. I'm not sure if we can implement CIDR with pure rego.

Assignments, Equality, Matching

This is a very special topic. There is a bunch of stuff going on which can cause significant issues. I think this is a decisive downside of the OPA rego language, maybe even a killing one. Let's examine:

Assignment

Rego treats = differently depending on context:

> a = 3 # This is an assignment
> a
3
> a = 4 # This is an assertion which looks for a value in a which is 4 (none is found -> undefined)
undefined
> a = 3 # This is an assertion looking for value 3 in a which is defined -> true
true
> a == 3 # This is an equality check
true
> a # a is still 3
3

I think this is a serious mistake in the concept of OPA which will probably not be fixed as it would most likely break a ton of existing code. Here's how they explain it:

The equality operator (=) is used to define expressions that assert that two values are the same. If the expression is defined in terms of one or more variables then the expression will evaluate to true if one of the variables is unbound. If the neither operand is an unbound variable, the expression is evaluated by comparing the values referenced by the operands.

OPA attempts to bind variables to values when it encounters unbound variables in equality expressions. Binding a variable affects subsequent evaluation of expressions such that the variable will be treated as a constant (with the bound value) instead of a variable.

This can lead to significant issues. Let's take the following sites definition (everything taken from OPA docs):

sites = [
    {
        "region": "east",
        "name": "prod",
        "servers": [
            {
                "name": "web-0",
                "hostname": "hydrogen"
            },
            {
                "name": "web-1",
                "hostname": "helium"
            },
            {
                "name": "db-0",
                "hostname": "lithium"
            }
        ]
    },
    {
        "region": "west",
        "name": "smoke",
        "servers": [
            {
                "name": "web-1000",
                "hostname": "beryllium"
            },
            {
                "name": "web-1001",
                "hostname": "boron"
            },
            {
                "name": "db-1000",
                "hostname": "carbon"
            }
        ]
    },
    {
        "region": "west",
        "name": "dev",
        "servers": [
            {
                "name": "web-dev",
                "hostname": "nitrogen"
            },
            {
                "name": "db-dev",
                "hostname": "oxygen"
            }
        ]
    }
]

Now we define something OPA calls "Rule", here we generalize a reference:

> hostnames[name] { sites[_].servers[_].hostname = name }
> hostnames[x]
+-------------+--------------+
|      x      | hostnames[x] |
+-------------+--------------+
| "hydrogen"  | "hydrogen"   |
| "helium"    | "helium"     |
| "lithium"   | "lithium"    |
| "beryllium" | "beryllium"  |
| "boron"     | "boron"      |
| "carbon"    | "carbon"     |
| "nitrogen"  | "nitrogen"   |
| "oxygen"    | "oxygen"     |
+-------------+--------------+

I think this is already confusing enough. So x is an unbound variable which is returned as a value by hostnames[x]. Above, we define that hostnames has basically an array of one value. We set the value of that to the hostname. Note that switching positions of name and hostname yields the same result:

> hostnames[name] { name = sites[_].servers[_].hostname }
> hostnames[x]
+-------------+--------------+
|      x      | hostnames[x] |
+-------------+--------------+
| "hydrogen"  | "hydrogen"   |
| "helium"    | "helium"     |
| "lithium"   | "lithium"    |
| "beryllium" | "beryllium"  |
| "boron"     | "boron"      |
| "carbon"    | "carbon"     |
| "nitrogen"  | "nitrogen"   |
| "oxygen"    | "oxygen"     |
+-------------+--------------+

So far so good. Let's now really mess with this thing and bind variable name:

> name = "oxygen"
> hostnames[x]
+----------+--------------+
|    x     | hostnames[x] |
+----------+--------------+
| "oxygen" | "oxygen"     |
+----------+--------------+

name was shadowed and now name = sites[_].servers[_].hostname is no longer an assignment but an assertion. This will cause serious programming issues if you're not extremely careful. Especially in complex rules, this can cause significant issues which are very difficult to trace. Unfortunately, this appears to be a key "feature" of OPA/rego, citing docs:

When a comprehension refers to a variable in an outer body, OPA will reorder expressions in the outer body so that variables referred to in the comprehension are bound by the time the comprehension is evaluated.

Fortunately Rego knows how to scope variables too with :=. This (obviously?) won't work when the asignee is the existing value:

> hostnames[name] { sites[_].servers[_].hostname := name }
1 error occurred: 1:19: rego_compile_error: cannot assign to ref

But it does work the other way around:

> hostnames[name] { name := sites[_].servers[_].hostname }
> hostnames
[
  "oxygen",
  "hydrogen",
  "helium",
  "lithium",
  "beryllium",
  "boron",
  "carbon",
  "nitrogen"
]

Of course it's possible to make sure unintentional shadowing doesn't happen (using :?), but you have to know it. All examples and docs overwhelmingly use =, which hints to the authors prefering implicit binding. I honestly think that this is a terrible design decision.

Here's another example for this from the docs:

> region = "west"; names = [name | sites[i].region = region; sites[i].name = name]

sites[i].region = region is an assertion (because region = "west") while sites[i].name = name is an assignment (for name). I think this is neither readable nor a good design decision for a programming language.

Assigning _

I don't know why this happens:

> 3 = _
true

Docs say that _ is resolved to a random variable name internally. But that doesn't seem to be the case, because:

> 3 = asdf78zuhijk
+--------------+
| asdf78zuhijk |
+--------------+
| 3            |
+--------------+

This evaluates to nothing:

> _ = 3

Readability

In my personal opinion, rego is neither very readable nor beginner-friendly. The next example is again taken from the docs:

app_to_hostnames[app_name] = hostnames {
    apps[_] = app
    app_name = app.name
    hostnames = [hostname | name = app.servers[_]
                            sites[_].servers[_] = s
                            s.name = name
                            hostname = s.hostname]
}

I can't tell you what's going on here, at least not without looking very hard and sort of guessing what's an assignment and what an assertion. Maybe I need take some courses on the Datalog query language ;)

Ladon as an OPA Policy

So here is a strategy that uses exact string matching (case-sensitive):

package keto.ladon.exact

import input as request

policies := [
    {
        "resources": [`articles:1234`], # articles:<[0-9]+>
        "subjects": [`peter`], # zac|peter
        "actions": [`view`],
        "effect": "allow",
    },
    {
        "resources": [`articles:whatever`], # articles:<[0-9]+>
        "subjects": [`zac`], # zac|peter
        "actions": [`get`],
        "effect": "deny",
    }
]

default allow = false

allow {
  policies[i].resources[_] == request.resource
  policies[i].subject[_] == request.subject
  policies[i].action[_] == request.action
  policies[i].effect == "allow"

  # request.condition
}

not allow {
  policies[i].resources[_] == request.resource
  policies[i].subject[_] == request.subject
  policies[i].action[_] == request.action
  policies[i].effect == "deny"

  # request.condition
}

So far I'm still trying to figure out how we could implement conditions using OPA. I'm not very hopeful that it will be possible. However, maybe, we can keep conditions as they are and offer them as an add-in. Not sure if it makes a lot of sense to do so as rego is quite expressive, but maybe it's an opportunity to make this easier to use.

Conclusion

This is it for my first analysis of OPA. From the initial hyped-up state to actually working with this, I have to say that I see serious design mistakes in OPA and rego.

Regardless of whether we actually use OPA in this project or not. I think there's a great opportunity to improve the ORY Keto project, primarily by adding an RBAC, ABAC, ACL engines. What we have to figure out is if these components are isolated or are somehow working together.

RomanMinkin commented 5 years ago

To me the beauty of Ladon (AWS like) policies syntax is the readability for non-developer users. That was one of the reasons to choose Hydra+Keto.

From the glance OPA's rego DSL look more powerful, but hard to read.

There is also casbin https://github.com/casbin/casbin access control library. It's very sweet, but policies DSL has a high bar to enter.

MOZGIII commented 5 years ago

@aeneasr good review, I didn't know that much about OPA. When you're referring to RBAC/ABAC/ACL - is there a place where those are defined? What I know these is those aren't exact formats/implementation or even patterns, but rather mere ideas on how to design the access control solution. Is there anything more specific to them? What I mean is, if I understand correctly, there can be multiple very different approaches on how to implement ACL. Is that so?

aeneasr commented 5 years ago

When you're referring to RBAC/ABAC/ACL - is there a place where those are defined? What I know these is those aren't exact formats/implementation or even patterns, but rather mere ideas on how to design the access control solution. Is there anything more specific to them? What I mean is, if I understand correctly, there can be multiple very different approaches on how to implement ACL. Is that so?

Yes, these concepts are frameworks. Specifically ABAC. ACL and RBAC are more fixed but have also different variants. RBAC has Hierarchical RBAC and plain RBAC. Depending on what you need there are different levels of granularity. But there is usually a standard RBAC and ACL case which is generlizable.

dkushner commented 5 years ago

@aeneasr First off, the fact that these explorations are taking place in the open and with an active effort to involve the community is tremendous. This is exactly why we opted to use Ory products in our stack and why we've come to trust (but verify) them in security-critical roles.

The analysis you've done is excellent and I agree on all points. Having a fuzzing run would be good practice in general but what you've already provided seems to spell out the reasons why OPA should not be a valid target at the moment. That the parser is so easily defeated by what may be considered run-of-the-mill programming/structural mistakes is a non-starter. However, stability can be improved over time so, for the sake of discussion, we won't count this as an immediate disqualification.

The issue here that stands out to me most is the scope of what OPA intends to do. From their own example set, the policies are intended to govern everything from deployment targets to infrastructure provisioning to application access control. As someone else has mentioned here already, is this not the role of CI/CD pipelines and organizational deployment policies? If you are in a position where you need an automated system (outside of existing access controls on internal CI/CD pipelines, cloud service APIs and in the case of Kubernetes, the clusters themselves) to step in and tell you you're trying to deploy code to the wrong place, you have screwed up somewhere. Sorry for the digression into personal opinions, but I would be remiss if I didn't address how much OPA just smacks of "answering a question nobody asked."

What I enjoy about Ladon is that it is a focused "protocol" (for lack of a better word) implemented in a human-readable and well-traveled transport format. By targeting GCP/AWS IAM-like features, Ladon has laid out for itself a very attainable and tightly-scoped roadmap. OPA seems to be aiming at a language which can be used to describe these other systems and so, naturally, seems more expressive, but to what significant advantage?

aeneasr commented 5 years ago

Thank you for your insights @dkushner . I generally agree that OPA tries do solve too much. On the same hand though, I think it could become the driver behind ORY Keto. So we (the maintainers) learn and understand OPA by the book and add abstractions on top called ORY Keto. I'm not sure about this path yet, just some ideas. I'll definitely think about this more.

And thank you all for participating! It's great to see that this product is useful to many!

aeneasr commented 5 years ago

I thought about this a bit more. I will explore if we can embed OPA in keto/ladon and use it as an (extensible) backend for an easy-to-consume frontend (as it is today).I will definitively forward these findings to the OPA maintainers and see what they have to say about it.

What bothers me at the moment is performance and usability and storage and testing in ladon. I think that regex is hard to write for most people and even harder to test. We also lack a way of testing policies in general.

The storage adapter (SQL) has some serious issues wrt performance and scalability. I also think that keto should work for rbac and acl. All of these „issues“ have led me to think about OPA and the urge to improve the keto and/or ladon. I think he next step is to talk to OPA maintainers and experiment further with the technology. Personally, I like config as code and the general trajectory OPA has, even if I strongly disagree with some of the design decisions they made. I think we can learn from them and improve our products.

And as always, open for input & ideas. I’ll keep you posted on mine!

hsluoyz commented 5 years ago

As @RomanMinkin mentioned, you can also consider Casbin (https://github.com/casbin/casbin). It is the most starred authorization library in Golang. There are several differences between Casbin and OPA.

Feature Casbin OPA
Library or service? Library/Service Library/Service
How to write policy? Two parts: model and policy. Model is general authorization logic. Policy is concrete policy rule. A single part: Rego
RBAC hierarchy Casbin supports role hierarchy (a role can have a sub-role) Not supported
RBAC separation of duties Not supported Supported: two roles cannot be assigned together
ABAC Casbin supports to directly retrieve Golang struct's members as attributes OPA needs to be provided with an attribute list or Golang struct
Built-in functions RESTful match, IP match, regex are supported. You can also write your own Golang function and let Casbin use it Functions like regex, max, min, count, type conversion are supported. You can write your own built-in functions.
Policy storage All common databases are supported by dozens of middlewares, like SQL, NoSQL, Key-Value, AWS S3, etc. Not supported, you need to write your own code if you want to use DB like MySQL.
Conflict resolution Allow-override, Deny-override, Allow-and-no-Deny, Priority are built-in supported. You can also write your own Effector logic (in code) to have a custom conflict resolution Allow-override, Deny-override, Priority (but grammar is a little long). You can also resolve conflicts inside Rego itself.
Distributed authorization You can use multiple Casbin instances together. Sharding and policy change notification are supported One single OPA service
Other programming languages Golang, Java, PHP, Node.JS, Python, .NET, Delphi, Rust are supported Golang
Adopters Intel, VMware, Docker, Cisco, Banzai Cloud, Orange, Tencent Cloud, Microsoft Netflix, Chef, SolarWinds, Cisco, Cloudflare, Pinterest, State Street Corporation

(let me know if the above table is not accurate)

In conclusion, if you need an authorization service right now, you probably should use OPA. Otherwise, maybe Casbin is a better choice.

aeneasr commented 5 years ago

Hm, interesting, thank you for the comparison! I will definitely check out casbin too, it looks promising. Especially the "You can also write your own Golang function and let Casbin use it" is something we need to properly implement a BC-engine for ladon.

aeneasr commented 5 years ago

I checked in with the OPA maintainers and also took a swift look at casbin. Here are some of my findings:

OPA

Casbin

The DSL seems to me very close to the actual implementation of casbin. It feels like casbin implements some well-known patterns and allows a bit of customization around them. But once you want to do something exotic, I'm not sure if that would work with casbin as the project (casbin) itself may has to be modified. Personally, I find the DSL a bit easier to read than rego, but it comes at the cost of flexibility.

I am quite sure that we can't implement conditions with casbin, the DSL is too simple for that.

Next steps

I will experiment some more with rego for now, I think the project is very promising despite some of the disadvantages I noticed. Another possibility is to improve the ladon project and trying to reduce runtime complexity as well as adding * matching.

aeneasr commented 5 years ago

Here's an example of how evaluation of conditions could work: https://gist.github.com/tsandall/fd95373554653afff943b3b3efcc1509 - this is quite exciting :)

aeneasr commented 5 years ago

So, I was able to implement the ladon logic (without regex for now) using only rego. I added the StringEqualsCondition condition type (adding more is quite easy) only for now:

default allow = false

allow {
    effects := [effect | effect := policies[i].effect
            policies[i].resources[_] == input.resource
            policies[i].subjects[_] == input.subject
            policies[i].actions[_] == input.action
            all_conditions_true(policies[i])
        ]

    effect_allow(effects)
}

effect_allow(effects) {
    effects[_] == "allow"
    not any_effect_deny(effects)
}

any_effect_deny(effects) {
    effects[_] == "deny"
}

all_conditions_true(policy) {
    not any_condition_false(policy)
}

any_condition_false(policy) {
    c := policy.conditions[condition_key]
    not eval_condition(c.type, c.options, condition_key)
}

eval_condition("StringEqualCondition", options, key) {
    input.context[key] == options.equals
}

A complete definition including some test cases can be found here: https://gist.github.com/aeneasr/1f6b9fbe60c9047f6127f9a26cc8d980

This is generally very exiting because the code in ladon can be reduced significantly (well, it would be moved to the third-party library that implements rego) while making it possible to add more access control concepts on the fly and without much hassle. I think this would also allow us to easily extend and/or modify keto/ladon behaviour, maybe even at runtime?

OvermindDL1 commented 5 years ago

Just a note about the = operator in rego, it's not an assignment operator, in fact it looks identical to the match operator in the erlang/elixir/etc languages, including even the _ usage, assertions, etc. Perhaps thinking of it and documenting it in that way would be exceptionally more clear as it was for those above server languages?

aeneasr commented 5 years ago

Ah, thank you for the clarification @OvermindDL1 - while I know some syntax (like _) from SML/NJ the = was definitely new to me. I still think it's confusing, but the rego specification itself is very powerful.

Now that I had the chance to take some time off and think about the future of ORY Keto, I think there is a plan.

Vision

  1. ORY Keto will stay backwards compatible with ORY Ladon policies.
  2. ORY Keto will run an embedded OPA server (or rather rego parser).
  3. ORY Keto will no longer be "ORY Ladon" as a service, but instead a permission server that supports different common access control patterns, like RBAC, ACL, Ladon Policies, ...
  4. ORY Keto may be able to be extendable or modifiable with your own rego modules.

Benefits

First of all, we will deprecate the ORY Ladon SDK in ORY Keto and instead move to a rego-backed decision engine. This will not impact backwards compatibility negatively. Instead, everything ladon was capable of doing before, ORY Keto will still be able to do.

This has various benefits:

  1. Preliminary benchmarks show that rego is a faster decision engine due to various optimizations (mostly caching)
  2. It is possible to get in-depth insights into the decision-making and even set up watchers that notify you if something out of the ordinary happens. We will have to see how easy this is to expose in Keto, but I think it's possible with a bit of work.
  3. We can easily add new access control patterns (like RBAC, ACL) without having to re-write decision engines, decision logs, ... every time.
  4. We can allow developers to add their own rego definitions during runtime

OPA itself is a really good piece of software. One issue I see is that they (which is totally ok) work on clustering, better storage systems, as part of their start up. This means that there's tons of room for Keto to add value on top of OPA.

On important thing I learned from this discussion here is that all of you value simplicity and just want to have something that works. Learning rego has been a bit challenging for me personally and I completely understand that you don't want to learn a new language "just to do rbac". This in turn leaves me to believe that we can provide off-the-shelf mechanisms (backed by whatever decision engine, for now rego) that solve this problem in an easy-to-consume way.

Architeture & API Design

So first of all, we'll obviously deprecate a large portion of the code base as most of it is primarily there to integrate with ladon. Next, I think it would make sense to separate the API into access control modules.

Similarly, there would be more endpoints like hrbac for hierarchical rbac, acl for access control lists, and so on.

I also think there could be some type of endpoint where you can expose your own rego logic. I'm not sure yet how that would look like, but that's also the reason why I think prefixing those patterns with ac, pattern, engine, or whatever, makes sense.

This of course means that there will be a breaking change in the next keto release (or whenever this is going to be addressed) in terms of HTTP API, but it will be possible to migrate older systems to the new version without much hassle.

This also means that the storage implementation will change. This has been a huge pain since the inception of ladon. The SQL statements are very complex, also because we're using regular expressions that have to be matched which is very, very, very slow with most databases. By moving this to a "unstructured" storage approach (saving the rego files to disk/memory/sql store) and treating, for example policies, as unstructured documents (with some validation in place of course) - we can further increase the speed of the project.

Anyways, that's it for now from me. I will experiment with setting up OPA as an embedded server today. I'd love to get your take on this. For me personally, this makes a ton of sense!

kfox1111 commented 5 years ago

Very interesting thread.

Just going to touch on the last comment. You mention using an unstructured storage approach. I was going to ask if something similar could be implemented. These days we're using a lot of Kubernetes. Now that it supports CRD's, (https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/), I've seen a lot of projects go stateless by storing their state in Kubernetes objects. (You already are paying to maintain stateful storage for Kubernetes, why not just reuse it). Thought about using this mechanism for policy storage? It could allow easier scaling up/down Keto and deployment.

kfox1111 commented 5 years ago

oh. and perhaps a raw fs mode for storage if you have static policies. you could tie a prebaked docker image for keto with either the kubernetes git checkout volume, or ship your policies in a docker image added at runtime, so you can come up with a ci/cd driven pipeline that produces an auto scalable policy server.

dkushner commented 5 years ago

@kfox1111: We use Kubernetes almost exclusively for a large scale, multiple provider cloud-native stack as well. We have a few applications that use CRDs and annotations to allow applications to expose certain pieces of metadata on deployment. Most notably our gateway (Ambassador) and our service mesh (Istio). These are pretty unique cases in that they are projections into Kubernetes of what would normally be provider-hosted pieces of infrastructure. These components bridge the Kubernetes control plane with the application/data plane in a tightly controlled context. They use the Kubernetes API extensions and CRDs to accomplish this because that is the idiomatic way to configure this type of interaction. I'm not strictly against this sort of integration, but can you clarify what exactly you are hoping to gain from such a feature?

Kubernetes already has the concept of RBAC in the form of ClusterRole and ClusterRoleBinding in concert with ServiceAccount resources. Are you proposing that Keto be implemented as an RBAC agent in the Kubernetes control plane? As I understand your comment, you're looking to pre-configure Keto with a built-in set of rules as part of a deployment, but I fail to see how this helps accomplish what a normal pod/deployment lifecycle script could not. Keep in mind that CRD storage is not exactly a high-performance database. Resources are generally accessed at most once during a normal deployment, scaling event, etc.

MOZGIII commented 5 years ago

I'm also interested a lot in static policies and support for stateless operation mode. Also, I've recently implemented a custom access control system for GraphQL. Maybe there's a way I could make use of Keto somehow for managing policies? With the OPA under the hood it looks more possible than before...

aeneasr commented 5 years ago

Yes, I think tight integration with kubernetes is a big plus. It alleviates the pain of managing SQL and is definitely possible due to the simple data model of the ORY ecosystem. I'm not sure if we'll add a k8s connector right away, but it's on the list!

aeneasr commented 5 years ago

I was able to implement ladon's policies using pure rego. Check out the PR: https://github.com/ory/keto/pull/48

Please leave your comments, ideas, suggestions, feedback there. I will track the progress of the refactoring there.

kfox1111 commented 5 years ago

@dkushner I'm thinking, for example, of a helm chart from a user that: launches a ory keto instance, an instance of its web service with proxy configured for keto, and a set of crd's for the policies it needs to implement the security properly.

The whole thing would then be testable as a single artefact (helm package) and immutable as it moved through the dev/test/prod lifecycle. Policy would be just another object in the chart then, so very natural in the k8s development cycle.

aeneasr commented 5 years ago

How do you all think about deprecating the authenticators currently implemented in this project? An authenticator basically allows you to send some type of credentials (e.g. OAuth 2.0 Access Tokens, JSON Web Tokens, ...). The result of the authenticator is a string (the subject / "user-id") and is used as the "subject" value when calling ladon.

I think this has two sides:

  1. Good:
    • It makes it very easy to just push credentials to this server and getting the allowed or denied answer. No middleware madness.
  2. Bad:
    • It's sort of shitty to configure using environment variables. Especially when we talk JWT we might need some specific ways of validating (or ignoring) certain claims. I think this is something we learned at Oathkeeper where people have custom JWT claims or need to set other JWT claims in their tokens. Same goes for e.g. token scope or probably also something like SAML assertions which are much more complex to validate iirc.
    • It solves more problems than we really set out for here - are we actually looking to resolve the whole authN + authZ stack here? I think the answer is no. If you want the full solution, you can use oathkeeper or another proxy capable of doing authN!
    • It increases code complexity and maintenance. I think we could solve Keto with very little code as most things will be delegated to OPA/rego. This is great for stability, maintainability, and probably also ease of use.

Happy to hear your thoughts!

MOZGIII commented 5 years ago

I'm not sure what this is about, but I'm pretty sure that if Keto has some built in logic to parse JWT and extract sub from it, it wouldn't be enough for my use case unless it allows to also extract some additional claims. Overall, I don't see the full picture, but my guess is the context of Keto should be bound to doing permissions "calculations" as a service, and whoever calls it should prepare the data for the call in a readable form. In particular, I'd make it so JWTs are parsed by the "Gateway Service", and that Gateway Service to then do a call to Keto to validate the permissions. However, there will be too much hops with that approach - and I'd probably consolidate permission validation and some other stuff at the Gateway Service itself (everything AuthN/AuthZ related). But that's off topic already, so I'll stop myself here.

aeneasr commented 5 years ago

You analyzed it just right and summarized the solution similar to what I'm thinking as part of the bigger image (and what Oathkeeper does). Glad you confirmed my bias ;)

MOZGIII commented 5 years ago

In my view, Gateway should also be merged with the API service - better if it's something like GraphQL based - to allow going to multiple services in single query. This is beneficial, since latency from the Gateway to all the services in typically lower than latency between the User Agent and the Gateway. In fact, I implemented this design in the past, and can recommend it. Although is works for relatively simple AuthZ models, where whether you allow access or not can be decided entirely at a Gateway level. Downside is if you have some AuthZ decision that has to occur somewhere deep inside the service (not the Gateway) you can't make a call to the permissions service to test the permissions. However there are ways around it: first, most of the time it's possible to structure your API is such a way that allows all the AuthZ decisions be abstracted away to the Gateway, secondly, if you really stumble upon on a case where it's really tricky and you absolutely need to do AuthZ inside the service (i.e. in the middle of execution) - your Gateway can inject additional context into the underlying call to service (at the Gateway<->service API layer) in the forms that suits the edge case - from passing an IP address, to serializing as passing the currently applied RBAC in some form. By the way, the idea of passing the serialized subset of RBAC (or other kind of) rules alongside with the request from the Gateway to a service looks interesting. Has anyone seen it done before?

dkushner commented 5 years ago

@MOZGIII @aeneasr Man this conversation is moving fast. I'll try to catch up.

Kubernetes Integration

I know this isn't quite on-topic but the issue was raised and responded to, so I figure I'm okay. I think it would serve to elucidate the exact mechanism of integration here and what value it would bring to the project. Kubernetes is a tremendously powerful orchestration platform, to be sure, but it is not an application development platform. It is not intended as a data store for mission-critical data which, I think we can all agree, is exactly the kind of data that Ory products deal with. As I mentioned in my previous comment, there are a number of applications in the wild that integrate with Kubernetes to improve the ergonomics of deploying and maintaining their software. To me, the most salient and relevant examples are the Ambassador API gateway (see previous post for this) and the use of Vault as a Kubernetes secret store. Here's a brief overview:

  1. In the first case, Ambassador uses CRDs to eliminate the need to maintain an independent routing table or perform multi-step deployments into Kubernetes. This is useful to developers because it means they don't have to redeploy their gateway when they introduce a new application or modify an existing one. These CRDs contain basic configuration documents and are used as a way for applications to signal infrastructure about configuration changes at deployment. Emphasis mine. It does not offer an enhancement of the Kubernetes control plane, it merely uses the extensibility of the Kubernetes API to configure the product in a way that is familiar and convenient for developers.
  2. In the second case, Vault may be used to enhance existing functionality in the Kubernetes control plane. Here, Vault takes the place of Kubernetes' normally quite rudimentary secret storage mechanism and integrates with existing security/RBAC features to allow developers to store and retrieve secrets in a way that circumvents the original functionality (Secrets API) but leverages the rest of the security infrastructure (ClusterRoles tied to AppIDs, ServiceAccounts to encapsulate token exchange, etc) to make the developer's life easier.

I don't really see a valid use case for custom CRDs as it relates to Ory products. The only thing I can think of off-hand is automating the creation of new client IDs for deployed applications and associating roles with them? What problem are we looking to solve here? Perhaps people are using different stacks that have different challenges and I'd love to hear about them.

Alternatively, is the intent here to replace existing Kubernetes authentication/access control mechanisms with Hydra and Keto? In this case, I strongly disagree with the claim that this would offer the developer an improved deployment experience. It would make supporting deployed applications more difficult and expose control-plane services to serious vulnerabilities.

For context, our stack uses a deployment of Hydra pods behind a headless service. We also have a deployment of Keto pods with a similar service setup. Both of these services register themselves via CRDs/annotations with the deployment of Ambassador pods that acts as our API gateway and is exposed to the internet via provider-backed L7 load balancers. We use Vault for our secrets management in a method similar to the one described above. These secrets include the keys and configuration used to deploy Hydra and Keto. Ambassador performs the initial authentication checks on inbound requests since it has support for delegating request authentication to an arbitrary service (we're basically just validating the access token here). The services themselves merely validate the presence of the required access token and scope, while resource access is validated through Keto in parallel with the logic to actually retrieve the resources from the relevant datastore/downstream service.

Keto Authenticators

It solves more problems than we really set out for here - are we actually looking to resolve the whole authN + authZ stack here? I think the answer is no. If you want the full solution, you can use oathkeeper or another proxy capable of doing authN!

Yep, wholeheartedly agree. Software works best when it does one thing and does it well. This is also why I hesitate to support the move to OPA, but I am coming around to the idea after having played with the agent in a sandbox cluster for a bit. The stack I described in the previous section is simplistic but I feel represents the strategy most developers/architects apply in every-day use: pick the tool that best fits the job and reduce friction wherever integration is necessary. Hydra/Keto were the best fit for our requirement of having a standards-compliant OAuth 2.0/OID service that allowed us full ownership over identity management. Ambassador was our best-fit for a cloud-native API gateway and it reduced integration friction by supporting configuration via CRDs (we actually moved to this from Kong which had no/poor support for this).

MOZGIII commented 5 years ago

@dkushner, regarding integration with k8s - I'd say it has it's own AuthN/AuthZ system that's great, and there's no need/use for Keto at AuthN/AuthZ in the ops layer (especially if you use k8s/istio). Keto is still useful though if you're building your own software and need permission management at a different level - for example to authorize actions in a webapp you're building. I suppose, you won't (although, technically, you can) use k8s authz and rbac to manage your business users and actions - it'd be a really bad idea. But this is where you would use Keto.

dkushner commented 5 years ago

@MOZGIII Perhaps I was unclear, or just long-winded, but that's exactly how I describe using Keto in my current stack. The question still remains, what did you mean about Kubernetes integration?

MOZGIII commented 5 years ago

@dkushner, nvm. I don't think any k8s integration is needed, except for the support of stateless operation mode, where you'd just use ConfigMap/Secret or env vars to provide the policies, instead of storing them in the database.

MOZGIII commented 5 years ago

Btw, I'd like that for hydra too - currently we need this "init" contrainer that's hanging around in the deployment just to create clients at every environment. Much better way would be to allow bootstraping clients via env vars.

aeneasr commented 5 years ago

I know this isn't quite on-topic but the issue was raised and responded to, so I figure I'm okay. I think it would serve to elucidate the exact mechanism of integration here and what value it would bring to the project. Kubernetes is a tremendously powerful orchestration platform, to be sure, but it is not an application development platform.

This is actually on-topic. I am pushing an internal strategy towards kubernetes-native software!

It is not intended as a data store for mission-critical data which, I think we can all agree, is exactly the kind of data that Ory products deal with. As I mentioned in my previous comment, there are a number of applications in the wild that integrate with Kubernetes to improve the ergonomics of deploying and maintaining their software. To me, the most salient and relevant examples are the Ambassador API gateway (see previous post for this) and the use of Vault as a Kubernetes secret store.

I agree. Kubernetes is not intended to be a matured datastore with sophisticated transactions, atomicity, durability. Application data belongs in a database, not in Kuberentes. However, I think there are areas where CRD makes sense for ORY Products. Let's take a look:

Alternatively, is the intent here to replace existing Kubernetes authentication/access control mechanisms with Hydra and Keto? In this case, I strongly disagree with the claim that this would offer the developer an improved deployment experience. It would make supporting deployed applications more difficult and expose control-plane services to serious vulnerabilities.

No, at least I'm not talking about this. We're trying to solve Thermosphere L7 problems - problems that are specific to application logic as opposed to orchestration of services.

Yep, wholeheartedly agree.

Great! Let's throw away that stuff!

@MOZGIII

Much better way would be to allow bootstraping clients via env vars.

I was thinking for a while about importing clients on boot from e.g. /init.d/ if configured but that's another discussion.

MOZGIII commented 5 years ago

kubernetes-native

Keep in mind that not everyone uses Kubernetes - there are lots of cases when it does more harm than good... And I'd still want to use your stuff without k8s :)

aeneasr commented 5 years ago

Yeah absolutely! What I meant to say with that is that we are making a push towards CNCF membership, easier integration with the k8s ecosystem and are looking for ways in which we can work together with istio architects. There is a ton of exciting stuff going on right now, but most still focuses on the platform aspect and gives little help on the application developer's side of things. This does not mean that we want people not to be able to easily deploy any of our products to VMs, baremetal, heroku, cloud foundry, ... :)

kfox1111 commented 5 years ago

@aeneasr I agree CRD's could be used as an optional alternate to a db in certain subsets of Keto/Oathkeeper instantiations.

This statement though:

Kubernetes is not intended to be a matured datastore with sophisticated transactions, atomicity, durability.

Is semi correct, but it is also more functional then that makes it out to be. There is atomicity and limited transactions. all updates to a document's fields happen at the same time (transaction) and there are atomic primitives in the api to allow updating only if unchanged. This does provide a fair amount of guarantees when leveraged.

One feature that relational databases don't usually have, is the watch feature either. In a CRD based backed, each pod can directly cache all its policy CRD data, and then watch for CRD changes. So it won't be putting any remote load on a policy database unless it is told a change happens by the kube-apiserver. This should help scalability and reliability too.

Its not a solution for everyone. But it does have some unique benefits for certain use cases.

MOZGIII commented 5 years ago

It's worth mentioning that CRD is meant for configuration updates, not OTP. In that sense, it can be used for something like configuring clients at hydra, and can't be used for storing refresh tokes.

aeneasr commented 5 years ago

Is semi correct, but it is also more functional then that makes it out to be.

Oh man, after 10 (?) years of OSS maintainership I thought I learned to never make unsubstantiated claims, but sometimes one slips out by accident. What I meant of course is that I don't see kubernetes CRD as a way to store business logic. This doesn't imply CRD isn't atomic or supports transactions, just that it's not the right tool for this job, but it definitely is the right tool for other jobs (like the ones mentioned earlier).

However, I can now actually give one back :)

One feature that relational databases don't usually have, is the watch feature either.

This was true a decade ago but is no longer the case. PostgreSQL and MySQL (hacky but possible via triggers IIRC) support it and I'm sure others, like CockroachDB support it too or are at least seriously considering adding it to the project. Modern architectures demand it :)

aeneasr commented 5 years ago

@dkushner you said you're actively using ORY Keto in prod at the moment. How much of an impact will it have for you if endpoints move around? I'm assuming you're using the warden/subject/authorized endpoint for policy validation? I don't think we can support the old URL structure once this patch lands but also want to minimize impact or provide at least good upgrade paths.

kfox1111 commented 5 years ago

Hehe. no worries.

I stand corrected on the db watching. Didn't realize that was a thing. /me wonders off to figure out what he can use his new found db watching powers for... :)

aeneasr commented 5 years ago

Build the next real time decentralized (block chain ofc) social network (pets.com) of course!!! Oh wait, SQL.

Haha, back to topic :D

On 26. Sep 2018, at 21:10, kfox1111 notifications@github.com wrote:

Hehe. no worries.

I stand corrected on the db watching. Didn't realize that was a thing. /me wonders off to figure out what he can use his new found db watching powers for... :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

dkushner commented 5 years ago

@kfox1111: Self-auditing services or, in our case, CQRS/ES services.

@aeneasr: Yep! And we wrote support libs for the languages we use that implement the access control checking logic in an idiomatic way. All of them source their configuration from the environment/application config so changing the endpoint should be a one line change. Is the structure of the actual request changing or just the endpoint route?

Also we're in "production" such as it is. We're dogfooding it internally, its not yet exposed to the outside world pending security review, but has been working as documented so far, which is all I really ask for. :+1:

aeneasr commented 5 years ago

Yeah the actual payloads won't change. The warden endpoint will still expect a subject, resource, action and context where as the policy definition itself is also left completely unchanged. Just the endpoint naming will shift to better prepare keto for more access control patterns. I'm currently thinking to prefix everything with /engine and let the concrete implementations (rbac, acl, ...) deal with the specifics of setting up the handlers. So, for example /engine/ladon/regex/allowed, engine/ladon/regex/policies, /engine/ladon/regex/roles. This would be the current way of how ladon's policies work. Then we would have /engine/ladon/equals/* which would be a case-sensitive string match (as opposed to regex) and /engine/ladon/glob/* for glob-matching.

dkushner commented 5 years ago

@aeneasr, that is an interesting concept. In my mind, the difference between a Ladon and OPA policy is merely content type, no? You mentioned earlier that the Ladon specification may be implemented as a subset of OPA, I think that is probably the most ergonomic way forward. Provide the ability to create policies using OPA documents for experienced users, for users with simpler use cases, allow them to create policies via Ladon documents that are rendered to OPA assertions. You can even use the same endpoints, simply indicating the type of document with Content-Type header values. This maintains RESTful route structure, allows seamless migration from Ladon-based policies to OPA and complies with standards (i.e. content hinting over HTTP).

aeneasr commented 5 years ago

Yup, not sure about Content-Type because it's more about business logic. OPA as a decision engine is already available as a code preview in #48

dkushner commented 5 years ago

@aeneasr: Ah I misunderstood the intent. I was assuming that Keto would support both Ladon and Rego documents as policy definitions and then simply render any submitted Ladon documents to Rego given that it has the more expressive syntax. Would this not be a viable solution to allow for the gradual migration of existing users over to Rego from Ladon? I'll take a look at the pending PR.

aeneasr commented 5 years ago

ORY Ladon implements a concept. A concept of access control policies. The implementation is written in Go. The same concept can be implemented with rego/opa. #48 implements the ORY Ladon concept with rego instead of the ORY Ladon SDK. The only thing you need to migrate is some URLs which will change, not the concept of access control policies.

aeneasr commented 5 years ago

This truly went better than expected. There is a full proof of concept which implements the full ORY Keto capabilities (Access Control Policies, Roles, Warden) but using OPA as the decision engine/evaluator. The new code isn't even ~2000 LoCs despite already implementing equals and regex matching. Adding new modules such as RBAC, ABAC, ACL should be a breeze. All we're doing is adding some input validation on top of rego and offering easier clustering and clear APIs without the need to learn rego. This is amazing, I'm really happy with the result :)

Check it out: https://github.com/ory/keto/pull/48

aeneasr commented 5 years ago

48 is now merged!