CPU Hungry OPA container

curtisallen commented 5 years ago

OPA running as a sidecar service in kubernetes is pretty CPU hungry. I'm doing a stress test, and OPA seems to require more CPU resources than my application. I've been fiddling with the policies I have loaded and turned on partial queries (which did help significantly), but overall I'm pretty disappointed with OPAs performance. To rule out inefficient policies I ran the same test with the most straightforward policy default allow=true max Queries (over a 2-minute time span) dips about 50% with this simple policy. I'm not using any advanced features like bundles and only leverage the Data API.

Screen Shot 2019-05-08 at 1 24 11 PM

On the Y axis we have the total number of requests served over a 2-minute time span. The Red represents the error rate with a 1-second timeout on OPA responses.

With OPA in the loop our load tester sent 400 requests/second with 100 concurrent users (go-routines). Each test configuration was ran 3 times and the graph above shows the average of the three test runs.

Without OPA in the loop, I was able to run 2000 requests per second with 100 concurrent users.

Here's the raw data that I visualized in the graph above

Screen Shot 2019-05-16 at 10 43 44 AM

Test Bed

My application is a golang rest API service, OPA is invoked as authorization middleware for every API request to my application. The only policy loaded into OPA is

default allow = true

Here's an example json payload that my application will send to OPA (/v1/data)

Payload

```json { "Input": { "Path": "cars.update", "Resources": { "Car": { "ID": "85d4961e5854" } }, "User": { "UserSession": { "user": "someusername", "user_id": "123userid" } } } } ```

To which OPA will return something like

This response

```console HTTP/1.1 200 OK Content-Length: 342 Content-Type: application/json Date: Wed, 09 Jan 2019 18:18:44 GMT { "result": { "Path": "cars.update", "allow": true } } ```

My application ensues OPA returns "allow":true before performing the API request.

My application will timeout waiting for a response from OPA after 1 second (if I increase this timeout I get a better success rate)

The Stress Test

OPA is deployed as a sidecar to my application in kubernetes, here are the resource claims

Kubernetes Deployment

```yaml spec: containers: - name: myapp ... resources: limits: cpu: 256m memory: 256Mi requests: cpu: 256m memory: 256Mi - name: opa image: "openpolicyagent/opa:0.10.7" ports: - containerPort: 8181 args: - run - --server - --log-format - json resources: limits: cpu: 200m memory: 64Mi requests: cpu: 200m memory: 64Mi ```

Here's the code our application uses to call OPA

Or middleware calls `allowed` below ```golang func (omw *OPAAuthClient) postJSON(ctx context.Context, body []byte) ([]byte, error) { var logger = log.FromContext(ctx) var sURL, err = url.Parse(omw.host + "/v1/data") if err != nil { return nil, err } var req *http.Request req, err = http.NewRequest("POST", sURL.String(), bytes.NewReader(body)) if err != nil { return nil, err } req.Header.Add("Content-Type", "application/json; charset=utf-8") var resp *http.Response resp, err = omw.httpClient.Do(req) if err != nil { return nil, err } defer func() { var err = resp.Body.Close() if err != nil { logger.Errorw("Error closing body reader.", zap.Error(err)) } }() if resp.StatusCode != http.StatusOK { return nil, fmt.Errorf("OPA API unexpected status code %d returned from url %s", resp.StatusCode, sURL) } var bytes []byte bytes, err = ioutil.ReadAll(resp.Body) if err != nil { return nil, err } return bytes, nil } // allowed sends the given request to the OPA process func (omw *OPAAuthClient) allowed(ctx context.Context, input Request) (Response, error) { var request, err = json.Marshal(input) if err != nil { return Response{}, err } var logger = log.FromContext(ctx) logger.Debugf("Sending to OPA:%s", request) var respBytes []byte respBytes, err = omw.postJSON(ctx, request) if err != nil { return Response{}, err } if len(respBytes) == 0 { return Response{}, fmt.Errorf("Allowed error: empty body") } var resp Response err = json.Unmarshal(respBytes, &resp) if err != nil { return Response{}, fmt.Errorf("Allowed unmarshal error: %s", err) } return resp, nil } ```

During a stress test, we see a 50% decrease in RPS with OPA in the loop then OPA out of the loop with error rates (due to OPA timeout) of 8.2%

If I look at the OPA container CPU throttle Seconds (kubernetes metric), we see OPA demand an additional 750ms (on an m5.large) CPU claim to satisfy the load Screen Shot 2019-05-09 at 9 32 14 AM The Yellow line is this graph is OPA, while the green line is my application.

Given these results, we've removed OPA from our application, but I wanted to open this issue to help the maintainers improve performance hopefully.

tsandall commented 5 years ago

@curtisallen thanks for filing issue and supplying such a great write-up. Sorry for the long delay on the response.

A few thoughts:

50% decrease is not great but it's also relative to how much work the service is doing. For example, I tested a modified version of this example and after making a few improvements to the OPA integration I observed the service was able to serve ~2,000 requests/sec with OPA (HTTP over localhost) versus ~4,800 requests/sec without OPA. The service needs about 200 microseconds to serve each request and OPA adds another ~200-300 microseconds. Given what's happening with OPA in the loop, this isn't too surprising (e.g., open TCP connection to OPA, serialize input attributes and send over connection, deserialize input attributes for policy evaluation, execute policy, serialize answer, send output over connection, etc.) Note, with Decision Logging disabled the numbers improve as well (up to ~2,500 requests/sec.) Since you're in Go, embedding OPA as a library and avoiding the context-switch and serialization will also help.
Regarding the timeouts it would be useful to know (i) how long the queries are taking (ii) how much long each concurrent user waits before sending a new request and (iii) how utilized the nodes are (i.e., whether OPA can use more than 200m cpu resources). With an m5.large instance there are 2 vCPUS (1 physical CPU) so OPA can serve at-most 2 requests at a time. If each request takes 2.5ms (400 RPS = 2.5ms per request) then I wouldn't expect any user to be blocked for more than 125ms however since OPA only has 200m cpu resources to work with this could be more like ~600ms). This back-of-the-envelope calculation gets us close to the 1s timeout (although not quite.)

One thing we could do is improve OPA server (and related integrations) to use prepared queries. We have two issues for this #1553 (for OPA-Envoy integration) and #1567 (for OPA HTTP server). I think this would help in your case because as you see most of the time is spent on query parsing and compiling (even when ?partial is provided). With prepared queries those stages are cached:

$ curl 'localhost:8181/v1/data?partial&metrics&pretty'
{
  "metrics": {
    "timer_rego_input_parse_ns": 736,
    "timer_rego_module_compile_ns": 289,
    "timer_rego_module_parse_ns": 332,
    "timer_rego_query_compile_ns": 113906,
    "timer_rego_query_eval_ns": 18087,
    "timer_rego_query_parse_ns": 267391,
    "timer_server_handler_ns": 1379409
  },
  "result": {}
}

Beyond that we could also try to make recommendations for how many cpu resources to allocate to OPA to fit certain budgets. E.g., if your OPA latency budget is 1ms and you run on m5.large give OPA X cpu resources.

tsandall commented 5 years ago

I've filed #1601 for that last thought. I'm going to close this because I think we have follow-up issues for everything we can do at this time. Please feel free to comment/make other suggestions!

jin09 commented 4 years ago

@tsandall @curtisallen @ashutosh-narkar what is the recommended amount of CPU that we should allocate for OPA so that we make the most out of the server. Not sure but OPA being CPU intensive, will 200m CPU not suffer from a lot of context switching? To increase throughput, can we run multiple OPA nodes and load balance among them?

tsandall commented 4 years ago

Per my comment on #1601 OPA will use as many cores as possible (by default).

To increase throughput, can we run multiple OPA nodes and load balance

among them?

Yes, but this is your responsibility. OPA does not provide any kind of clustering/sharding/load balancing capabilities today. You would need to distribute policies (and any required data) to each OPA instance (e.g., using the OPA Bundle API: https://www.openpolicyagent.org/docs/latest/management/#bundles). Keep in mind, this model would be eventually consistent. Hope this helps.

On Fri, Jan 10, 2020 at 8:42 AM Gautam Jain notifications@github.com wrote:

@tsandall https://github.com/tsandall @curtisallen https://github.com/curtisallen @ashutosh-narkar https://github.com/ashutosh-narkar what is the recommended amount of CPU that we should allocate for OPA so that we make the most out of the server. Not sure but OPA being CPU intensive, will 200m CPU not suffer from a lot of context switching? To increase throughput, can we run multiple OPA nodes and load balance among them?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/open-policy-agent/opa/issues/1434?email_source=notifications&email_token=AAB2KJKHVHCGM2DMNRS6PDTQ5B3MNA5CNFSM4HNNUQ42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIT6K5I#issuecomment-573039989, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB2KJPJBTEVE36XAL7UHVTQ5B3MNANCNFSM4HNNUQ4Q .

-- -Torin

open-policy-agent / opa

CPU Hungry OPA container #1434

Test Bed

The Stress Test