Open onesafe opened 1 year ago
I did some benchmarking some time back specifically along those lines. Unfortunately, the slice that holds the cloned entries is a very small percentage of the overall memory used. Most of the allocations are deep down inside of the protobuf cloning and reflect
package usage. It didn't make that much difference.
I'll see if I can't expose a benchmark for this. I have some stuff already set up in relation to #4451.
I'll also benchmark some different options and alternatives for cloning the entries. Right now we're relying on proto.Clone which like I mentioned is super expensive. Since we know the type we're "cloning" we can do something more bespoke but efficient.
Hi @onesafe, out of curiosity, what are the net effects on the server performance you're noticing because of these GC events (e.g. high CPU usage, additional system latency due to stop-the-world events)?
I also wasn't sure how to interpret the graph you attached. Could you provide some more information on what data that graph is displaying?
Hi,I have remove cloneEntries and applyMask in GetAuthorizedEntries. Found that the average time consumption has dropped from 30ms to 5ms,and the cpu load has significantly decreased. I have a question, can I operate like this if I don't use the applyMask function? What are the typical scenarios for using applyMask?
A few things to note:
I ran a few benchmarks. In the following table, clone
means cloning via proto.Clone, manual
means a manual duplication of the entry struct and all fields, and shallow
means a simply copy of the struct without cloning field contents.
goos: darwin
goarch: arm64
pkg: github.com/spiffe/spire/pkg/server/cache/entrycache
│ old.clone.txt │ old.manual.txt │ old.shallow.txt │
│ sec/op │ sec/op vs base │ sec/op vs base │
GetAuthorizedEntriesInMemory-12 329.1µ ± 0% 222.0µ ± 0% -32.56% (p=0.000 n=10) 135.7µ ± 0% -58.77% (p=0.000 n=10)
│ old.clone.txt │ old.manual.txt │ old.shallow.txt │
│ B/op │ B/op vs base │ B/op vs base │
GetAuthorizedEntriesInMemory-12 285.1Ki ± 0% 285.1Ki ± 0% ~ (p=0.129 n=10) 139.8Ki ± 0% -50.97% (p=0.000 n=10)
│ old.clone.txt │ old.manual.txt │ old.shallow.txt │
│ allocs/op │ allocs/op vs base │ allocs/op vs base │
GetAuthorizedEntriesInMemory-12 3005.0 ± 0% 3005.0 ± 0% ~ (p=1.000 n=10) ¹ 605.0 ± 0% -79.87% (p=0.000 n=10)
¹ all samples are equal
From this we can see that moving from proto.Clone to a manual copying reduced CPU by 32%, though the number of allocations does not change, so I would imagine GC pressure would be the same.
Moving from proto.Clone to a shallow copy of the struct (the least safe), shows a 60% reduction in CPU and 80% reduction in allocations. This approach would have a large impact on reducing GC pressure, thought it comes with the greatest risk of accidental cache pollution if callers were not careful with the entries returned from the cache.
It's also worth noting that we are developing a new cache. Running a similar benchmark, I see the following:
goos: darwin
goarch: arm64
pkg: github.com/spiffe/spire/pkg/server/authorizedentries
│ new.clone.txt │ new.manual.txt │ new.shallow.txt │
│ sec/op │ sec/op vs base │ sec/op vs base │
GetAuthorizedEntriesInMemory-12 191.15µ ± 0% 137.84µ ± 0% -27.89% (p=0.000 n=10) 95.35µ ± 0% -50.12% (p=0.000 n=10)
│ new.clone.txt │ new.manual.txt │ new.shallow.txt │
│ B/op │ B/op vs base │ B/op vs base │
GetAuthorizedEntriesInMemory-12 200.0Ki ± 0% 200.0Ki ± 0% ~ (p=0.271 n=10) 127.3Ki ± 0% -36.37% (p=0.000 n=10)
│ new.clone.txt │ new.manual.txt │ new.shallow.txt │
│ allocs/op │ allocs/op vs base │ allocs/op vs base │
GetAuthorizedEntriesInMemory-12 1513.0 ± 0% 1513.0 ± 0% ~ (p=1.000 n=10) ¹ 313.0 ± 0% -79.31% (p=0.000 n=10)
¹ all samples are equal
There are similar ratios of reduction across the three "cloning" methods. Its worth pointing out that the authorized entry crawl in the worst case in the new cache is still twice as fast as the old cache. The number of allocations is also about half of the old cache.
@onesafe, also, I'm not familiar with the unit w
in 1w+
on your original post. Can you clarify that unit?
oh sorry, 1w means 10000 registry entries。
if callers were not careful with the entries returned from the cache.
Can you give an example?
Yes, see 7ea53c4cac46fbc99edcf26efc369bcc273bb60b.
How much of a reduction would be good enough for your purposes and how quickly are you in need of a solution?
The reason I ask is that there are two features that will arrive in the next few months that could improve this significantly without getting rid of the safety that the cloning provides.
The first is the new cache we are developing. As you can see from the above benchmarks, it reduces allocations in half. This cache should be available experimentally in 1.9.0 (Q4 2023) and likely be the default in 1.10.0 (Q1 2024).
The second is the new sync RPC that is currently under test. This RPC severely reduces the amount of traffic in all but the first sync of a given agent by only sending what has changed to each agent. It would only need to clone the new/updated entries that needed to be returned.
I'm not in a hurry for a solution, and these two features you mentioned are great. I look forward to their launch.
This issue is stale because it has been open for 365 days with no activity.
When Server have 1w+ registryEntry per agent, GetAuthorizedEntries Method cause high memory usage, causing frequent GC.
can we use sync.Pool Optimize memory reuse in cloneEntries? (https://github.com/spiffe/spire/blob/main/pkg/server/cache/entrycache/fullcache.go#L278) Will it cause data confusion?