Performance issue when using optimize level 2

amirsalarsafaei commented 3 months ago

story time (you can skip this)

So long story short, Our company has been working on an API gateway authorization solution for our API's using OPA. Last week the project finally got to handle some real world request and 100ms latency ... It's CNCF we must've configured it wrong. We played with the configurations, turned off the logs, etc... As a last resort I updated the version to v0.67.0 we used glob.match and things got much worse(#6908). So I ditched the glob.match used code-generation to generate the rules and got the response times down, But there has been an itch inside my mind ever since. The last 2 weekends I've been playing with the codebase and commands, trying to understand the problem and finally, I've got some results.

Short description

When using opa build --optimize 2 to build a bundle, while having a rule like:

package smth.authz

import rego.v1

import data.jwks
import data.users

default allow := false

allow if {
    token.valid
    user.is_admin
}

user := users[token.payload.user_id]
token := {"valid": valid, "payload": payload} if {
    constrains := {
        "cert": jwks,
        "iss": "somewhere",
        "aud": "opa",
    }

    [valid, _, payload] := io.jwt.decode_verify(input.auth_header, constrains)
}

this causes output to look something like this:

package smth.authz

import rego.v1

default allow := false

allow if {
    [_term_3_13, __local5__2695, __local5__2695] = io.jwt.decode_verify(input.auth_header, {
        "cert": data.jwks,
        "iss": "somewhere",
        "aud": "opa",
    })

        _term_3_13
    [__local5__2698, __local5__2698, _ref_1922694] = io.jwt.decode_verify(input.auth_header, {
        "cert": data.jwks,
        "iss": "somewhere",
        "aud": "opa",
    })
    data.users[_ref_1922694.user_id].is_admin
}

as you can see this results in decoding the jwt 2-times and consequently doubling the latency. I dug further to find where the exact location of the problem is this section I put some breakpoints and found out that it's treating the io package as a headref and thinking that dereferencing is always done in little time and isn't an expensive task.

In our case the optimization generated 3 sequential jwt decode and caused 3x time penalty compared to no-optimization

Possible Solutions

I think first thing would be to warn people of the possible performance losses. Specially when the name is obviously misleading.
We could handle the exception of some builtin functions.
We could make the algorithm more complex accounting for arguments and parameters

I enthusiastic and would like to contribute and learn more:D I'm open to ideas. and could open a PR.

Thanks for the great product, sorry about the rant:p

anderseknert commented 3 months ago

Hi there! And thanks for filing another issue :)

The result of function calls should be cached by OPA for the scope a single evaluation, and as long as the input arguments are the same, repeating the call should be cheap. So at least to the extent that is true (and perhaps there are exceptions to that I'm not aware of), you're not decoding the token twice even if it looks like that's the case. If you're seeing evaluation time increase linearly to the number of identical decode calls made... yeah, that sounds like a bug.

Either way, at least in this case the second call is redundant, and it would be good if the optimizer recognized that.

amirsalarsafaei commented 3 months ago

Hi, But in my benchmarks the time is increasing linearly:D and sometimes even worse in the case when we were using glob.match there was a 10x to 100x increase in latency. I will investigate further:) but it correlates exactly with number of repetitions so It could be related to the caching mechanism

anderseknert commented 3 months ago

Sounds good. Let us know what you find!

amirsalarsafaei commented 3 months ago

@anderseknert I finally found time to take another look at it. The cache that you are referring to is nd_caching which is disabled by default, is poorly located in the documents and not mentioned in the performance section. So its safe to assume that someone uses optimize level 2 and doesn’t know about this configuration resulting in poor performance . I think improving the documentation would be a great step, but if the default behavior of nd_caching is preferred to be disabled and there is a reason for it, we can talk about fixing the optimizer.

anderseknert commented 3 months ago

No, that's something else :) The nd (for "non-deterministic") cache caches the results of non-deterministic built-in functions, like time.now_ns(), for the purpose of recording that along with decision logs. This is needed to be able to "back track" or replay decisions later when such functions have been called, as otherwise you'd likely get a different result when you do.

Perhaps I'm wrong about caching for built-in function calls, I honestly don't remember. I know that evaluation of rules is cached for sure, but you've made me uncertain about built-in functions. @ashutosh-narkar might be able to provide some insights here as I don't do a whole lot of work on OPA internals myself.

amirsalarsafaei commented 3 months ago

I actually read the code and ran the benchmark again , There is caching mechanism but its exactly before the call and its call nd caching mechanism enabling it fixes the performance issue so i think its enabled on the io.jwt.decode as well😅 i try again and share my results

amirsalarsafaei commented 3 months ago

Yep I tested it. The cache on builtin function calls (other than http-call which has its own separate caching) is nd caching and jwt decode count as non deterministic and enabling the cache boosts the performance significantly.

ashutosh-narkar commented 3 months ago

If you take a look at the code here you'll see the non-deterministic builtins are marked with the Nondeterministic: true property. As you mentioned, nd_builtin_cache is disabled by default and users need to opt-in. The contents of the cache are made part of the decision logs that OPA generates. There is potential for the cache to have a big size (eg. caching a large HTTP response) and then if it was made part of the decision log, it could exceed the max size configured for the log thereby requiring OPA to do extra work to re-process the log.

I think improving the documentation would be a great step

Feel free to submit a PR if you think the docs can be improved. Thanks!

amirsalarsafaei commented 3 months ago

@ashutosh-narkar what do you think about the optimizer problem creating duplicate function calls? Is that intentional?

ashutosh-narkar commented 3 months ago

@srenatus @johanfylling any ideas about the repeated function calls in the optimized policy? If that's contributing to an increased eval latency, then we should investigate this.

johanfylling commented 3 months ago

I think this can be worthwhile investigating. When the output from a non-deterministic built-in is reused, optimization is definitely not being optimal when it repeats the call rather than reuse its output.

amirsalarsafaei commented 3 months ago

Hi, I took some time to investigate and this is my findings: I had it all wrong thinking it was the copy propagator, after tweaking with the code on my local setup and countless bundles built.This line was the problem. When the optimization level is higher than 1, shallow inlining deactivates and subsequently causes the partialEval to be invoked instead of partialEvalSupport and this causes the repeated inline calls to io.jwt.decode.

I would like to solve the problem with some help do you by any chance know why do we use partialEval in optimize 2 instead instead of support version? what is the difference? I'd appreciate it:D @johanfylling

amirsalarsafaei commented 3 months ago

Update: I’m currently working on a fix, the problem is on optimization level 2. Compiler tries to inline as much as it can but this result in repeated in-lines as a result of rego document-based format. Im trying to cache the in-linings through the eval and try to fetch index from there instead of evaluating if it exists there.

amirsalarsafaei commented 2 months ago

``Hi again, so I'm a bit stuck, I'll try to explain the situation: So as I mentioned in previous comments the problem was the in-lining feature of optimizer. The implementation evaluates each document reference and rule completely (meaning each time a headref is references it evaluates and saves all of its body inline) so I tried fixing this by saving refs(rules) that have been evaluated in the corresponding scopes evaluator and check them in subsequent calls and try to reuse that binding. But this solution caused some issues in some policies, take a look at this policy:

package envoy.authz

import data.users

import input.attributes.request.http as http_request
import input.parsed_path as input_parsed_path
import rego.v1

default allow := false
default result := {}
# METADATA
# entrypoint: true
result := {
    "allowed": allow,
    "request_headers_to_remove": request_headers_to_remove,
    "response_headers_to_add": response_headers_to_add,
}

allow if {
    is_permitted
    is_whitelisted
}

is_permitted if {
    route_allowed
    user.role in ["ADMIN", "OPERATOR"]
}

is_whitelisted if {
    user.white_listed
}

request_headers_to_remove := ["x-user-id"]

response_headers_to_add := {
    "x-user-id": user_id,
    "x-user-status": user.status,
    "x-route-permissions": route_allowed.permissions
}

user_id := http_request.headers["x-user-id"]
user := users[user_id]

route_allowed := { "permissions": ["GET_SOME"] } if {
    http_request.method == "POST"
    input_parsed_path == [input_parsed_path[0], "get", "smth", ]    
    object.subset(user.permissions, ["GET_SOME"])
}

route_allowed := { "permissions": ["SOME_OTHER"] } if {
    http_request.method == "PUT"
    input_parsed_path == [input_parsed_path[0], "some", "other", input_parsed_path[3], ]    
    object.subset(user.permissions, ["SOME_OTHER"])
}

route_allowed := { "permissions": ["FOO", "BAR"] } if {
    http_request.method == "PUT"
    input_parsed_path == [input_parsed_path[0], "foo", "bar", input_parsed_path[3], ]   
    object.subset(user.permissions, ["FOO", "BAR"])
}

note that because of the in-lining and the route_allowed there should be 3 result term generated.

in our approach result tries to inline allow but can't because we need default value, so at first it tries to evaluate allow generating a new headref for it but as it saves the termbinding and tries to reuse that it no longer evaluates allow, allow is generated once with one of the route_allowed possible values but result gets generated correctly. so the problem arises from the fact that I save the in-lining and allow doesn't get evaluated again but I think it may be another flaw that the generation of multiple value inline allow depends on result it should generate all possible refs on first call independently. One other hunch I have is that because of the fact that I'm saving termbindings and that they change, that might contribute to said issue as well. but if we take a step back the whole in-lining approach in build is flawed because in runtime we cache the virtual refs whereas in build this can't be done if each time we reference something we are inlining all of the it's rule body(if its a virtual doc).

I might need some help, cause up until now I've been trying to fix the in-lining design but I might not know something about the design or missing something thats why I decided to share and ask for some pointers or inside. (a little push would suffice) this is the draft: https://github.com/open-policy-agent/opa/pull/6981

amirsalarsafaei commented 2 months ago

@johanfylling do you have any hints or ideas?

johanfylling commented 2 months ago

I would need to put some time to familiarize myself a bit more with this issue to be able to give informed pointers. Let me get back to you on that.

Some initial thought, though:

For optimizations to be effective here, I would only expect any form of new caching to be needed for cases where non-deterministic built-in functions are called. The Virtual Cache would otherwise apply when a rule/function is called multiple times during eval.
Caching would need to consider when the with keyword is used, as this might change the arguments going into the non-deterministic call.
How would we deal with non-deterministic built-ins where we don't expect the value to be re-used? Can we differentiate? Or scope the cache in some inteligent way? Seems like a hard problem.
For your initial example, it looks like a solution would perhaps not be to figure out a way to cache the result of an in-lined value for later re-use; but rather to have a mechanism to realize that the token rule cannot be inlined, and should remain as a rule that is then referenced multiple times.

no 4 is probably the most salient point/question here.

open-policy-agent / opa