Evaluate queries needs to get the entire database in cache to works

SomaticIT commented 4 years ago

Is your feature request related to a problem? Please describe.

Keto will encounter a lot of issue when it will be integrated in big applications. Indeed, to evaluate a IsAllowed request, it request the entire database in cache then apply rego policies.

Imagine you have an application used by 100k users which manages 100 policies and 5 roles on average. Each evaluate request will need to fetch 10M policies and 500k roles in cache and apply rego on 10.5M entities.

I think that's not a good option.

Describe the solution you'd like

I don't have the optimal solution right now.

I image multiple solutions:

Having a way to apply rego directly on the database
Having a way to prefilter entities before applying rego
Provide an argument to filter policies and roles ids using flavor (eg: policies:tenants:1:** using solution proposed in #186) in the Evaluate request for multi-tenant systems
Moving to a CQRS where a rego_data_request table is rebuilt whenever a new policy is inserted

I think you may have other ideas. What do you think?

Describe alternatives you've considered

The only alternative I see is to have an instance of Keto per tenant, which is very hard to maintain.

Additional context

I saw that it is possible to run go code directly on postgres: https://github.com/dbudworth/gopgfuncs

aeneasr commented 4 years ago

Yeah, we have been through this in multiple cycles. In the first instance, we used another business logic / framework called ladon. We used regexp matching in SQL there but obviously SQL can't use indices then which makes queries quite slow when many policies are used.

The next iteration (current master) uses rego on top of a cache. Unfortunately, rego turned out to be way heavier and way slower than expected and we ended up with several performance-related issues because of that. Rego/opa requires like 35GB memory and 6 CPUs if you're really stress testing the system with complex policies ( #104 ).

The internal decision is to move away from OPA and back to Ladon (for AWS-style policies). However, we're not sure yet how to solve the querying part and/or caching part.

In general, policies are better suited for use cases where you have few of them. Using 100 policies per user sounds more like access control lists to me which we also want to support but don't do right now!

I think anything that runs in the database (e.g. pg funcs) will eventually run into the same problems as we do right now.

To be honest, I think that keto will evolve into something that works better with RBAC (+ Attributes) / ACL than into the AWS IAM Policy direction. While it was a pretty good idea to define a common language for defining access rights, AWS IAM Policies have caused a lot of headache around accidental data leaks for example. I think the move by Google and others away from policy DSL and towards RBAC + Attributes (e.g. Role "VM Owner" on GCP Project "my-project") illustrates that.

One major problem we discover though is data filtering is very hard. Let's take lists for example. Assuming you have a list of items but you want to show only those items that you created (so basically an ACL) - how do you enforce that with something like Keto? It's not very feasable to query all rows in you SQL Database, and then make REST calls to Keto to figure out if you can show the item or not. Is fetching that list from Keto first the answer? Well, if you want synchronization issues then probably yes :D

This is pretty well explained in this blog post were Torin (OpenPolicyAgent maintainer beast) explains the problem. I am unfortunately not sure if the answer provided there (given that it e.g. doesn't work with JOINs) is what we're looking for - it looks quite complex and prone to errors during operation, when for example running migrations. Here's an example project though.

In conclusion, there are two major problems haunting this project at the moment - performance and filtering. Unfortunately, resources are quite sparse now and other projects have precedence, which is why we haven't been able to work on this. But we'd really want to solve these issues (or explain e.g. data-filtering a non-goal).

Any input is welcomed!

alexander-alvarez commented 4 years ago

Hi,

Thanks for all your work across the ORY stack. We're currently evaluating different components of it and I was currently testing out keto to see its limits before deciding if we could adopt it for our use case.

In my test I had 1 policy that hypothetically be used by many users, who have different roles. I tested with one to get an idea of how keto would perform with just handful of policies, a handful of roles, but lots of users

#   9,000 role-mappings, 1 policy -- 0.063 sec
#  90,000 role-mappings, 1 policy -- 0.272 sec
# 900,000 role-mappings, 1 policy -- 2.600 sec

It seems to be related to the above that you mention with loading all the data into opa before it could be evaluated (these lines https://github.com/ory/keto/blob/master/engine/ladon/handler.go#L487). This makes me wonder if it were better the role information was stored as some sort of many-to-many mapping so we can narrow down the data only the roles applicable to the user when querying by a user subject and then provide all the roles for a given user as opa ["inputs"] instead of needing it to load the entire corpus of user mappings just to evaluate 1 subject's access.

So for example:

k = num policies
n = number of user-roles mappings
m = number of roles for a user
# current solution
k * n 
# proposed idea
k * m

I think it's safe to say for most systems m < n by magnitudes.

I think as a temp work around we can keep the user-role mappings out of keto, and use the role as the subject with the initial request, and if the user has multiple roles, unfortunately we'd have to query multiple times for each user's role, or just not allow a user to have multiple roles and design policies, etc accordingly

Does this analysis make sense? Any suggestions?

aeneasr commented 4 years ago

I don't quite remember the exact internals but it is possible that small optimizations like that could be possible. As said above, there's still a plan to move away from OPA but we're currently lacking developers who would be able to take a lead on this as we are working on Kratos.

alexander-alvarez commented 4 years ago

I totally understand. Is there anything I can do to help out that's not swap out the entire engine?

We're bound to build out these user -> role mapping tables like I mentioned above, so it would be great if it could be done in a way that contributes back and fits in with your architecture.

aeneasr commented 4 years ago

I totally understand. Is there anything I can do to help out that's not swap out the entire engine?

😅 I think to resolve the issue mentioned in the OP that is the only possibility right now. I feel uncomfortable with merging database-specifica (e.g. a solution for postgres only) because all databases should more-or-less work equally well.

alexander-alvarez commented 4 years ago

I feel uncomfortable with merging database-specifica (e.g. a solution for postgres only) because all databases should more-or-less work equally well.

Agreed.

What I had in mind was introducing a few new tables instead of persisting them in the rego policy, which I guess could be considered on track towards what the setup may be without rego?

role_mapping

role_id	user_id
car.admin	user.0
payment.admin	user.1
car.viewer	user.1

subject_policy_mapping

policy_id	subject_id
car.admin.edit	car.admin
car.view.products	car.viewer
car.view.products	car.admin
car.misc	user.1


WITH role_mappings as (
--- selects all roles for which a subject could be tied to (if subject is a user)
    SELECT  role_id
    FROM role_mapping
    WHERE user_id = <subject_id>
) ,
policies AS (
--- selects all policies that map to any of the roles tied to this subject
--- or this subject itself (in the case the policy is defined directly to a user subject or
--- the subject provided is actually a role already)
    SELECT policy_id, subject_id
    FROM subject_policy_mapping spm, role_mappings rm
    WHERE 
        collection = <collection> AND (
            spm.subject_id = rm.role_id OR spm.subject_id = <subject_id>
    )
)
SELECT r.*, p.subject_id
FROM rego_data r, policies p
WHERE document->id = p.policy_id

Does that make sense?

aeneasr commented 4 years ago

Oh I see, so you want to filter the policies based on the role already in the database!

The problem with this approach is the matching strategy. So both glob and regex can not be executed in the database using an index. If we run a regex compare on e.g. the subject_ids then we will still end up with O(n) * O(m) where O(n) is the regex complexity and O(m) is the amount of policies. It's actually possible that Go will perform better here because Regex is guaranteed to be of O(n) complexity and we have all the data in memory as opposed to on disk.

So while this would work for the exact strategy, it would fail for regex and glob. This was the problem that haunted Ladon / ORY Keto before OPA and one of the reasons why we wanted to give OPA a try. Unfortunately, it performed even worse for use cases where you have many decisions in parallel.

Honestly, I think the best approach would be to figure out a good cache/in memory structure that would be able to filter data with regex/glob patterns. One example would be caching found policy IDs for a subject and invalidating those matches if any of the returned policy IDs change. I'm not sure if we would get a ton of advantage from using e.g. GroupCache (or etcd or any other distributed k/v store) but it could be an optimization on top

If we have that, we already solved the matching part and can actually rip out OPA.

aeneasr commented 3 years ago

I am closing this issue as our Google Zanzibar-based refactoring is scheduled to be released soon. Ory Keto up to version 0.5 will go in hibernation mode and receive only critical security patches.

ory / keto

Evaluate queries needs to get the entire database in cache to works #187

role_mapping

subject_policy_mapping