open-policy-agent / opa

Open Policy Agent (OPA) is an open source, general-purpose policy engine.
https://www.openpolicyagent.org
Apache License 2.0
9.74k stars 1.35k forks source link

Compile API performance degradation with dynamic policy composition #5216

Open srlk opened 2 years ago

srlk commented 2 years ago

Hello

We are using dynamic policy composition to evaluate policies. We have over 1000 policies and a part of them are evaluated to find final decision (allow/deny)

We have noticed a significant slowdown on Compile API as the number of policies are increasing, even though majority of the policies are not evaluated with the help of dynamic policy composition.

Short description

OPA version: 0.43.0

I have created a gist to replicate the behavior on a simplified version of our policies. More details on next section.

Steps To Reproduce

  1. First creating the policies by running createpolicies.sh with 10 policies.

This will create the main.rego

package main
denies[x] {
    x := data.policies[input.type][input.subtype][_].denies[_]
}
any_denies {
    denies[_]
}
allow {
    not any_denies
}

And 10 policies with different package names

package policies["1"]["1"].policy1

denies[x] {
 input.attribute == "1"
 x := "1"
}
  1. Starting OPA as server and running a test on compile endpoint with ab with request body as
{
  "query": "data.main.allow == true",
  "unknowns": [
    "input.attribute"
  ],
  "input": {
    "type": "1",
    "subtype": "1"
  }
}
  1. Results with 10 policy run are quite good.
Percentage of the requests served within a certain time (ms)
  95%     18
  98%     23
  99%    113
 100%    434 (longest request)
  1. Creating 10000 policies by running createpolicies.sh and re-running tests. And results show a big degradation
Percentage of the requests served within a certain time (ms)
  95%    591
  98%    711
  99%    820
 100%   1674 (longest request)
  1. Final test I have ran is - I modified the main.rego and tried to hard code the fields (which are normally received via inputs) those are used for dynamically loading policies
x := data.policies[input.type][input.subtype][_].denies[_] --> x := data.policies["1"]["1"][_].denies[_]
  1. Rerunning the tests with modified main rego, results are promising again.
Percentage of the requests served within a certain time (ms)
  95%     19
  98%     21
  99%    115
 100%    121 (longest request)

Expected behavior

Expected behavior is not to have a performance penalty on compile API when dynamic composition is used.

Additional context

Policy Compile p95 latency Ab results Output from compile API with metrics
10 Policies - Dynamic composition based on input 18 ms link link
10000 Policies - Dynamic composition based on input 591 ms link link
10000 Policies - Dynamic composition (hard coded in policy) 19 ms link link
srenatus commented 2 years ago

Thanks for the detailed report. I'm not sure there is much we can do here, though. There's complexity that we can't make disappear when using dynamic composition, so your expectation perhaps just can't be met.

That said, an increase like that, given the increase of policies, doesn't seem that bad. The step from 10 policies to 10k is huge.

anderseknert commented 2 years ago

The problem here isn't dynamic policy composition though, but the combination of that and the Compile API, no? I loaded the test data provided by @srlk and queries over the data API are still ~1 ms or so, which makes sense given how the query is essentially just a hash lookup. What makes the Compile API different in that regard? πŸ€”

srlk commented 2 years ago

Hey @anderseknert You are absolutely right, the issue happens when dynamic composition is used together with Compile API. It's not only latency increasing - also the memory usage is increasing with Compile API. When a regular query is done, there's no slowdown we are observing. For a workaround, instead of creating a single main.rego entry point, we have created multiple main endpoints main.type1.rego, main.type2.rego, main.type3.rego ...

package main.type1
denies[x] {
    x := data.policies["type1"][input.subtype][_].denies[_]
}
any_denies {
    denies[_]
}
allow {
    not any_denies
}

This keeps Compile API performance manageable, until remaining dynamic part of the policy (input.subtype in this example) increases in the number of objects.

anderseknert commented 2 years ago

Given that the inputs used for policy composition are known, which they are in this case, and replacing those with constants render expected performance characteristics... yes, this looks like a bug to me.

anderseknert commented 2 years ago

Any updates here, @philipaconrad? πŸ™‚

philipaconrad commented 2 years ago

@anderseknert The situation is better overall, now that #5307 is merged. It sped up type checking during Compile operations, and pushed out the "degradation zone" for policy compilation up to around 2500+ policies, instead of ~1000 policies.

However, I can't say the issue is entirely resolved; there are still issues where the Golang slice allocator blows up during type checking, and that will require some serious refactoring work to resolve. It's why I didn't mark #5307 as resolving this issue. :sweat_smile:

anderseknert commented 2 years ago

Thanks @philipaconrad! @srlk would be interesting to see what your numbers look like testing this with OPA v0.46.1 πŸ˜ƒ

ashutosh-narkar commented 1 year ago

Closing this given the updates in https://github.com/open-policy-agent/opa/pull/5307. Feel free to re-open if issue still exists.

philipaconrad commented 1 year ago

@ashutosh-narkar The allocation explosion in the typechecker still exists at larger policy sizes, based on the benchmarks in #5757, and #5307. We should reopen this, until we address the underlying problem.

AdrianArnautu commented 1 year ago

@philipaconrad / @ashutosh-narkar any plans to restart the work on this task? We're very keen to reduce the compile API evaluation time even by small margins.

ashutosh-narkar commented 1 year ago

@AdrianArnautu we kept this issue in the backlog and intend to investigate further. I don't have a timeline atm but we'll look to address this in the next few releases. If you're interested feel free to work on this and we're happy to help.