Closed lukyer closed 1 year ago
I briefly checked your code. Throwing in some random idea:
key
size at all https://github.com/open-policy-agent/opa/blob/7b6a04eac255b214c86ad57c9c5644d7be0a75fe/topdown/cache/cache.go#L123-L123http.send
request as a key but maybe it is significant size so the 100M limit for values might not have even a chance to take effect?c.usage += k.SizeInBytes()
?not sure what exactly gets stored with every http.send request as a key but maybe it is significant size so the 100M limit for values might not have even a chance to take effect?
The entire input object to http.send
is the cache key . Looking at your input, it does not seem it would have a significant impact on the cache size.
so maybe you should also add there something like c.usage += k.SizeInBytes()
Yes. We could do that. It would have to be added in here.
You mentioned:
It doesn't look like the max cache size has any effect and eventually the OPA pod will crash on OOM.
We added this fix in 0.45.0
to drop items from the cache when the limit is hit. Not sure at this point if this something related to Go not releasing memory back to the OS or if we should start removing items from the cache when it say 80% full or some other issue. Btw how much memory does the OPA container have?
Is this issue valid for your setup?
There are two parts to the problem:
c.l
here are only ever removed here, which is only ever happens when the cache is fullc.l
allows for duplicate elementsThis means c.l
will grow and grow until the cache becomes full or OPA reaches its memory limit and becomes oomkilled
It is easier than one might think to be affected by this issue. Consider the following setup:
http.send
has 10 second cache lifetimehttp.send
uses bearer authentication, size of JWT is approx 1KB, JWT is renewed every 5 minuteshttp.send
fetches data for individual users, each user will have a unique http.send request that is cachedhttp.send
is approx 2KBIf OPA in this scenario processes 100 active users, OPA would insert 100KB worth of keys into c.l
every 10 seconds, while the actual cache would insert 200KB of data every 5 minutes. 100KB every 10 seconds means the size of c.l
increases by 36 MB every hour! Meanwhile c.usage
increases by only 2.4MB per hour. In this case the cache should peak after approx. 4 hours at a c.l
size of 144 MB where one expected the cache to use only 10 MB.
It would be great if OPA considered the size of cache keys😄
The entire input object to http.send is the cache key
Bytes wise it might be easily 50% overhead in our case. Values we load over http.send
are just around 500B jsons.
if we should start removing items from the cache when it say 80% full
It makes sense to me. Removing just enough items to free the cache space just for this particular item seems to be fragile solution. Anyway ultimately this would help a lot in our case https://github.com/open-policy-agent/opa/issues/5320 (because we use small TTL around 10s only)
Btw how much memory does the OPA container have?
request 500M, limit 1000M - having it bigger or smaller has effect only on how long it takes to hit the issue
Is https://github.com/open-policy-agent/opa/issues/5359 issue valid for your setup?
very probably it is, I also noticed that multiple concurrent requests are send when there is cache miss etc. I will check this issue in more detail later
@asleire Thanks for the example. I think it is actually pretty much our situation which causes the memory problems and unnecessarily high memory usage. I would consider these things to be solved:
What do you think?
take cache keys size into consideration
I think this alone would solve the problem entirely
proactively remove cache items that are already stale (max-age has been reached)
This would be nice when OPA shares memory with other applications. Otherwise, I don't think it matters if cache keys are taken into consideration
when cache cleanup is started (cache is full), perhaps remove more items at once, not just "as few items as possible to be able to insert this new cache item"
I don't think this would do any good. OPA would still need as much memory as it does when the cache is near-full. In comparison, the second bullet point might prevent OPA's cache from ever getting full to begin with
I think this alone would solve the problem entirely
It should solve memory size predictability and OOMs but we would still run into situation that OPA memory is maxed out and constant (never getting any lower because of missing (periodic) cache cleanup)
This would be nice when OPA shares memory with other applications. Otherwise, I don't think it matters if cache keys are taken into consideration
I guess it always matters if you pay for e.g. 1G memory for instances holding stale http.send
responses there :)
I don't think this would do any good. OPA would still need as much memory as it does when the cache is near-full. In comparison, the second bullet point might prevent OPA's cache from ever getting full to begin with
Yeah, probably overengineering. The other two improvements should be enough 👍
I've created this issue for including the cache key in the cache size calculation.
@ashutosh-narkar Hi, any updates on this one? Even after v0.47.3 bump (can be seen on the graph below) which fixes some cache related problems we are still experiencing this ever raising memory usage not respecting any cache limits (here is configured 100M cache limit):
Thanks!
💭 Do we have a more specific cache-size metric? The one here is including all heap allocations. Just wondering.
Not likely https://www.openpolicyagent.org/docs/latest/monitoring/#prometheus ? However it would be nice to have some.
I hope that this fix should help us to see memory usage stop raising at around 130MB.
any updates to this one? Nobody else experiencing this issue? 🤔
@lukyer I made a fix for the issue described here, it was merged a while ago and is part of the 0.48.0 release from yesterday. Have you tested it?
Ah I missed this one fix. I will bump OPA and keep it running for few days so we will know if it helped! Thanks a lot.
@asleire so unfortunately, 0.48.0 didn't solve our issue. Cache size: 100M, growing way over that until OOM kill.
I think the key size would have to be taken into account for cache size limit to fix it. Thoughts?
I think the key size would have to be taken into account for cache size limit to fix it. Thoughts?
Yep, that is the fix. The workaround however is to simply lower your cache limit. If for instance your average key size is 1kb and average response size is 200b, a cache limit of 100mb would mean 100mb of responses along with 500mb worth of keys for a total of 600mb of data
Using https://github.com/open-policy-agent/opa/issues/5385 to keep track of the changes needed in http.send
.
Short description
Our policy sends many cached
http.send
requests with lowmax-age
10s. OPA server is configured with 100M max cache size:I noticed that memory usage reported by OPA
go_memstats_alloc_bytes
looks like this:It is similar to what k8s reports to us. Also
gc
reports from GODEBUG are similar.It doesn't look like the max cache size has any effect and eventually the OPA pod will crash on OOM. We do not experience any errors, all works as expected only the memory is still growing until pod gets killed.
Http send looks like:
Version:
Expected behavior
Memory usage should stay consistent around 100M-150M
Any ideas what could be wrong here? Thanks!
Possibly related: https://github.com/open-policy-agent/opa/issues/5320