Consider supporting IoT scale policies

AceHack commented 6 years ago

We have 10s of millions of devices that in some cases need individual policies applied at the device level. From my understanding OPA caches data for speed, In the IoT case, this could easily blow out and use all memory, how will this project work in this scenario?

Thanks.

timothyhinrichs commented 6 years ago

OPA wasn't designed explicitly for IoT, but it was designed to be run on the edge--close to the software that needs policy decisions. I think we'd need to understand more to answer your question.

Do the 10M devices all have different policies, or are there relatively few different policies?
OPA probably won't run on every IoT device, but is there a place close to the devices where OPA could run? Then each OPA instance could have just the policies relevant for the nearby devices.

Happy to chat on Slack or here on Github or even on a google hangout to understand more.

AceHack commented 6 years ago

@timothyhinrichs

Best case there are few policies and they are applied to groups not devices themselves. Worse case there are multiple policies per device each one unique to the device

We have both scenarios

Let me give a few scenarios: 1) It's possible the devices might run OPA, it depends, not all IoT devices could as you stated. In the case where devices run OPA, this should be no problem as the number of policies applicable to a device would be low. I am curious how OPA would know to "cache" a group level policy on the device itself. 2) There are field gateways that could possibly run OPA, each field gateway usually handles around 2000-5000 devices each. Also, it's dynamic what devices are handled by which gateway, the devices dynamically swap back and forth between field gateways based on wireless RF/powerline mesh network performance. In this case, I'm again curious how only the policies applicable to devices under the field gateway would get caches but not others that would overload the memory on the gateway. 3) There is a cloud gateway that handles all devices, some connect directly, some connect though field gateways. In this use case, I'm very curious how the in-memory caching would not overload the memory of the cloud gateway. Today we cache "hot" data based on usage so "cold" data is evicted out of memory with memory pressure.

Thanks.

timothyhinrichs commented 6 years ago

That's helpful. Thanks.

First a few general remarks.

OPA is designed as a cache--not the source of truth--for policies so it can make fast policy decisions in close proximity to the software needing the decisions. It does not include a runtime dependency on say a database so that if your project/service already has a storage system (and most do), you don't need to add another one just for OPA.

What that means is that policy storage and distribution is up to you. Couple of thoughts here...

For Kubernetes, we used the API server (backed by etcd) for storage and built a side-car next to OPA to pull policy out of the API server and push it into OPA.
Netflix built their own storage/distribution layer when using OPA for microservice authorization. Talk from Kubecon
You could use something like AWS's S3 to store policies, combined with a service that knows which OPAs need which policies. You could imagine a wrapper around OPA that keeps a cache of whatever policies have been most recently used and dynamically fetches policies as necessary. Maybe you'd want a local database (or file system) to cache policies on disk so you don't need to hit S3 on every cache miss, giving you a caching hierarchy like: S3 > Database > In-memory.
OPA's policies can be arranged hierarchically (like a file system). So if you can structure your policies properly, your wrapper around OPA could be configured at each location to pull the right policies.
There is a pluggable storage layer in OPA. We'd love to have people add plugins for different storage systems like etcd, postgres, etc.
OPA includes algorithms to identify dependencies among policies so if you know you need decision from one policy, you can ask OPA to tell you about all the other policies you need. That only helps if you're running OPA as part of the management layer to make decisions about what policies should be sent where.

Now for each of your 3 scenarios, let's assume you have an OPA wrapper/sidecar that's responsible for fetching policies.

You could either configure the OPA wrapper with the names of policies it needs or have the OPA wrapper ask a service which policies it needs. (You could imagine building that service with OPA.)
I'd expect the S3 > database > memory caching scheme with dynamic fetching would work here
For the cloud gateway, I see some options
- the wrapper could store the policies in a local DB and keep the most-recently used ones in OPA's memory.
- you could horizontally scale the cloud gateway, sharding the policies, and directing traffic to the appropriate shard.
- you could enforce less granular policies at the gateway, and enforce more fine-grained policies on traffic that makes its deeper into the system at different enforcement points

tsandall commented 6 years ago

Closing this for now.

manifest commented 6 years ago

We have a similar use case. We're providing Massive Online Open Cources (MOOC) for our customers. On the one hand we have a subset of our users (for instance regular users, dataset of 10k items) and on another a subset of our webinars (for instance webinars of previous year, dataset of 100k items). What we want to perform (in most of our cases) is to allow our regular users a read access to webinars of previous year.

It looks like we would face degradation in performance and increasing memory consumption while dataset grows.

It would be nice having horizontal scalability for OPA. Maybe we should reopen the issue? Just to understand a number of use cases similar to ours.

mjugger commented 5 years ago

We have a similar use case. We're providing Massive Online Open Cources (MOOC) for our customers. On the one hand we have a subset of our users (for instance regular users, dataset of 10k items) and on another a subset of our webinars (for instance webinars of previous year, dataset of 100k items). What we want to perform (in most of our cases) is to allow our regular users a read access to webinars of previous year.

It looks like we would face degradation in performance and increasing memory consumption while dataset grows.

It would be nice having horizontal scalability for OPA. Maybe we should reopen the issue? Just to understand a number of use cases similar to ours.

I second that. I'm trying to create a UI that will allow end-users to create policies around the resources they created that they want to share with other users/organizations within my app. The UI would abstract the coding aspect of it and instead give users a UI with field and dropdown elements. Can you provide an idea of how many cached policies is too many?

tsandall commented 5 years ago

@mjugger I recommend you do some benchmarking of your own with representative policies because really depends on the policy/data in question. To give you a rough idea, I was recently doing some benchmarking on the tip of master and found that with ~8K rules loaded, OPA consumed ~150MB memory. With ~80K rules loaded, OPA consumed ~1.5GB (i.e., approximately 10x more, which was expected.) The rules in question were fairly simple, i.e., they performed a match on three input attributes that specified the subject, action, and resource to authorize. Note, in this scenario, the partial evaluation and compile time for ~80K rules is quite high (e.g., 30-40 seconds) so it's assumed that could be done offline/out-of-band of incoming requests. However, since the rules were understood by OPA's indexer, the evaluation itself was very fast (< 100 microseconds). Hope this helps.

open-policy-agent / opa

Consider supporting IoT scale policies #534