`http.send` possible memory leak?

lukyer commented 2 years ago

Short description

Our policy sends many cached http.send requests with low max-age 10s. OPA server is configured with 100M max cache size:

caching:
  inter_query_builtin_cache:
    max_size_bytes: 100000000

I noticed that memory usage reported by OPA go_memstats_alloc_bytes looks like this:

It is similar to what k8s reports to us. Also gc reports from GODEBUG are similar.

It doesn't look like the max cache size has any effect and eventually the OPA pod will crash on OOM. We do not experience any errors, all works as expected only the memory is still growing until pod gets killed.

Http send looks like:

    res = http.send({"method": "get",
                     "url": url,
                     "timeout": "10s",
                     "force_json_decode": true,
                     "cache": true,
                     "raise_error": false,
                     "headers": {"Authorization": sprintf("Bearer %s", [input.token])}})

Version:

FROM openpolicyagent/opa:0.45.0

Expected behavior

Memory usage should stay consistent around 100M-150M

Any ideas what could be wrong here? Thanks!

lukyer commented 2 years ago

I briefly checked your code. Throwing in some random idea:

this cache size limiter doesn't take into account key size at all https://github.com/open-policy-agent/opa/blob/7b6a04eac255b214c86ad57c9c5644d7be0a75fe/topdown/cache/cache.go#L123-L123
not sure what exactly gets stored with every http.send request as a key but maybe it is significant size so the 100M limit for values might not have even a chance to take effect?
so maybe you should also add there something like c.usage += k.SizeInBytes()?

ashutosh-narkar commented 2 years ago

not sure what exactly gets stored with every http.send request as a key but maybe it is significant size so the 100M limit for values might not have even a chance to take effect?

The entire input object to http.send is the cache key . Looking at your input, it does not seem it would have a significant impact on the cache size.

so maybe you should also add there something like c.usage += k.SizeInBytes()

Yes. We could do that. It would have to be added in here.

You mentioned:

It doesn't look like the max cache size has any effect and eventually the OPA pod will crash on OOM.

We added this fix in 0.45.0 to drop items from the cache when the limit is hit. Not sure at this point if this something related to Go not releasing memory back to the OS or if we should start removing items from the cache when it say 80% full or some other issue. Btw how much memory does the OPA container have?

ashutosh-narkar commented 2 years ago

Is this issue valid for your setup?

asleire commented 2 years ago

There are two parts to the problem:

Keys added to c.l here are only ever removed here, which is only ever happens when the cache is full
c.l allows for duplicate elements

This means c.l will grow and grow until the cache becomes full or OPA reaches its memory limit and becomes oomkilled

It is easier than one might think to be affected by this issue. Consider the following setup:

Cache size limit of 10 MB
http.send has 10 second cache lifetime
http.send uses bearer authentication, size of JWT is approx 1KB, JWT is renewed every 5 minutes
http.send fetches data for individual users, each user will have a unique http.send request that is cached
Response size for http.send is approx 2KB

If OPA in this scenario processes 100 active users, OPA would insert 100KB worth of keys into c.l every 10 seconds, while the actual cache would insert 200KB of data every 5 minutes. 100KB every 10 seconds means the size of c.l increases by 36 MB every hour! Meanwhile c.usage increases by only 2.4MB per hour. In this case the cache should peak after approx. 4 hours at a c.l size of 144 MB where one expected the cache to use only 10 MB.

It would be great if OPA considered the size of cache keys😄

lukyer commented 2 years ago

The entire input object to http.send is the cache key

Bytes wise it might be easily 50% overhead in our case. Values we load over http.send are just around 500B jsons.

if we should start removing items from the cache when it say 80% full

It makes sense to me. Removing just enough items to free the cache space just for this particular item seems to be fragile solution. Anyway ultimately this would help a lot in our case https://github.com/open-policy-agent/opa/issues/5320 (because we use small TTL around 10s only)

Btw how much memory does the OPA container have?

request 500M, limit 1000M - having it bigger or smaller has effect only on how long it takes to hit the issue

Is https://github.com/open-policy-agent/opa/issues/5359 issue valid for your setup?

very probably it is, I also noticed that multiple concurrent requests are send when there is cache miss etc. I will check this issue in more detail later

lukyer commented 2 years ago

@asleire Thanks for the example. I think it is actually pretty much our situation which causes the memory problems and unnecessarily high memory usage. I would consider these things to be solved:

take cache keys size into consideration
proactively remove cache items that are already stale (max-age has been reached)
when cache cleanup is started (cache is full), perhaps remove more items at once, not just "as few items as possible to be able to insert this new cache item"

What do you think?

asleire commented 2 years ago

take cache keys size into consideration

I think this alone would solve the problem entirely

proactively remove cache items that are already stale (max-age has been reached)

This would be nice when OPA shares memory with other applications. Otherwise, I don't think it matters if cache keys are taken into consideration

when cache cleanup is started (cache is full), perhaps remove more items at once, not just "as few items as possible to be able to insert this new cache item"

I don't think this would do any good. OPA would still need as much memory as it does when the cache is near-full. In comparison, the second bullet point might prevent OPA's cache from ever getting full to begin with

lukyer commented 2 years ago

I think this alone would solve the problem entirely

It should solve memory size predictability and OOMs but we would still run into situation that OPA memory is maxed out and constant (never getting any lower because of missing (periodic) cache cleanup)

This would be nice when OPA shares memory with other applications. Otherwise, I don't think it matters if cache keys are taken into consideration

I guess it always matters if you pay for e.g. 1G memory for instances holding stale http.send responses there :)

I don't think this would do any good. OPA would still need as much memory as it does when the cache is near-full. In comparison, the second bullet point might prevent OPA's cache from ever getting full to begin with

Yeah, probably overengineering. The other two improvements should be enough 👍

ashutosh-narkar commented 2 years ago

I've created this issue for including the cache key in the cache size calculation.

lukyer commented 1 year ago

@ashutosh-narkar Hi, any updates on this one? Even after v0.47.3 bump (can be seen on the graph below) which fixes some cache related problems we are still experiencing this ever raising memory usage not respecting any cache limits (here is configured 100M cache limit):

Thanks!

srenatus commented 1 year ago

💭 Do we have a more specific cache-size metric? The one here is including all heap allocations. Just wondering.

lukyer commented 1 year ago

Not likely https://www.openpolicyagent.org/docs/latest/monitoring/#prometheus ? However it would be nice to have some.

I hope that this fix should help us to see memory usage stop raising at around 130MB.

lukyer commented 1 year ago

any updates to this one? Nobody else experiencing this issue? 🤔

asleire commented 1 year ago

@lukyer I made a fix for the issue described here, it was merged a while ago and is part of the 0.48.0 release from yesterday. Have you tested it?

lukyer commented 1 year ago

Ah I missed this one fix. I will bump OPA and keep it running for few days so we will know if it helped! Thanks a lot.

lukyer commented 1 year ago

@asleire so unfortunately, 0.48.0 didn't solve our issue. Cache size: 100M, growing way over that until OOM kill.

I think the key size would have to be taken into account for cache size limit to fix it. Thoughts?

asleire commented 1 year ago

I think the key size would have to be taken into account for cache size limit to fix it. Thoughts?

Yep, that is the fix. The workaround however is to simply lower your cache limit. If for instance your average key size is 1kb and average response size is 200b, a cache limit of 100mb would mean 100mb of responses along with 500mb worth of keys for a total of 600mb of data

ashutosh-narkar commented 1 year ago

Using https://github.com/open-policy-agent/opa/issues/5385 to keep track of the changes needed in http.send.

open-policy-agent / opa

`http.send` possible memory leak? #5381

Short description

Expected behavior