open-policy-agent / opa

Open Policy Agent (OPA) is an open source, general-purpose policy engine.
https://www.openpolicyagent.org
Apache License 2.0
9.71k stars 1.35k forks source link

Consider supporting IoT scale policies #534

Closed AceHack closed 6 years ago

AceHack commented 6 years ago

We have 10s of millions of devices that in some cases need individual policies applied at the device level. From my understanding OPA caches data for speed, In the IoT case, this could easily blow out and use all memory, how will this project work in this scenario?

Thanks.

timothyhinrichs commented 6 years ago

OPA wasn't designed explicitly for IoT, but it was designed to be run on the edge--close to the software that needs policy decisions. I think we'd need to understand more to answer your question.

Happy to chat on Slack or here on Github or even on a google hangout to understand more.

AceHack commented 6 years ago

@timothyhinrichs

Best case there are few policies and they are applied to groups not devices themselves. Worse case there are multiple policies per device each one unique to the device

We have both scenarios

Let me give a few scenarios: 1) It's possible the devices might run OPA, it depends, not all IoT devices could as you stated. In the case where devices run OPA, this should be no problem as the number of policies applicable to a device would be low. I am curious how OPA would know to "cache" a group level policy on the device itself. 2) There are field gateways that could possibly run OPA, each field gateway usually handles around 2000-5000 devices each. Also, it's dynamic what devices are handled by which gateway, the devices dynamically swap back and forth between field gateways based on wireless RF/powerline mesh network performance. In this case, I'm again curious how only the policies applicable to devices under the field gateway would get caches but not others that would overload the memory on the gateway. 3) There is a cloud gateway that handles all devices, some connect directly, some connect though field gateways. In this use case, I'm very curious how the in-memory caching would not overload the memory of the cloud gateway. Today we cache "hot" data based on usage so "cold" data is evicted out of memory with memory pressure.

Thanks.

timothyhinrichs commented 6 years ago

That's helpful. Thanks.

First a few general remarks.

OPA is designed as a cache--not the source of truth--for policies so it can make fast policy decisions in close proximity to the software needing the decisions. It does not include a runtime dependency on say a database so that if your project/service already has a storage system (and most do), you don't need to add another one just for OPA.

What that means is that policy storage and distribution is up to you. Couple of thoughts here...

Now for each of your 3 scenarios, let's assume you have an OPA wrapper/sidecar that's responsible for fetching policies.

  1. You could either configure the OPA wrapper with the names of policies it needs or have the OPA wrapper ask a service which policies it needs. (You could imagine building that service with OPA.)
  2. I'd expect the S3 > database > memory caching scheme with dynamic fetching would work here
  3. For the cloud gateway, I see some options
    • the wrapper could store the policies in a local DB and keep the most-recently used ones in OPA's memory.
    • you could horizontally scale the cloud gateway, sharding the policies, and directing traffic to the appropriate shard.
    • you could enforce less granular policies at the gateway, and enforce more fine-grained policies on traffic that makes its deeper into the system at different enforcement points
tsandall commented 6 years ago

Closing this for now.

manifest commented 6 years ago

We have a similar use case. We're providing Massive Online Open Cources (MOOC) for our customers. On the one hand we have a subset of our users (for instance regular users, dataset of 10k items) and on another a subset of our webinars (for instance webinars of previous year, dataset of 100k items). What we want to perform (in most of our cases) is to allow our regular users a read access to webinars of previous year.

It looks like we would face degradation in performance and increasing memory consumption while dataset grows.

It would be nice having horizontal scalability for OPA. Maybe we should reopen the issue? Just to understand a number of use cases similar to ours.

mjugger commented 5 years ago

We have a similar use case. We're providing Massive Online Open Cources (MOOC) for our customers. On the one hand we have a subset of our users (for instance regular users, dataset of 10k items) and on another a subset of our webinars (for instance webinars of previous year, dataset of 100k items). What we want to perform (in most of our cases) is to allow our regular users a read access to webinars of previous year.

It looks like we would face degradation in performance and increasing memory consumption while dataset grows.

It would be nice having horizontal scalability for OPA. Maybe we should reopen the issue? Just to understand a number of use cases similar to ours.

I second that. I'm trying to create a UI that will allow end-users to create policies around the resources they created that they want to share with other users/organizations within my app. The UI would abstract the coding aspect of it and instead give users a UI with field and dropdown elements. Can you provide an idea of how many cached policies is too many?

tsandall commented 5 years ago

@mjugger I recommend you do some benchmarking of your own with representative policies because really depends on the policy/data in question. To give you a rough idea, I was recently doing some benchmarking on the tip of master and found that with ~8K rules loaded, OPA consumed ~150MB memory. With ~80K rules loaded, OPA consumed ~1.5GB (i.e., approximately 10x more, which was expected.) The rules in question were fairly simple, i.e., they performed a match on three input attributes that specified the subject, action, and resource to authorize. Note, in this scenario, the partial evaluation and compile time for ~80K rules is quite high (e.g., 30-40 seconds) so it's assumed that could be done offline/out-of-band of incoming requests. However, since the rules were understood by OPA's indexer, the evaluation itself was very fast (< 100 microseconds). Hope this helps.