[HELP] frequent GC, Can we use sync.Pool in cloneEntries?

onesafe commented 1 year ago

When Server have 1w+ registryEntry per agent, GetAuthorizedEntries Method cause high memory usage, causing frequent GC.

can we use sync.Pool Optimize memory reuse in cloneEntries? (https://github.com/spiffe/spire/blob/main/pkg/server/cache/entrycache/fullcache.go#L278) Will it cause data confusion?

azdagron commented 1 year ago

I did some benchmarking some time back specifically along those lines. Unfortunately, the slice that holds the cloned entries is a very small percentage of the overall memory used. Most of the allocations are deep down inside of the protobuf cloning and reflect package usage. It didn't make that much difference.

azdagron commented 1 year ago

I'll see if I can't expose a benchmark for this. I have some stuff already set up in relation to #4451.

azdagron commented 1 year ago

I'll also benchmark some different options and alternatives for cloning the entries. Right now we're relying on proto.Clone which like I mentioned is super expensive. Since we know the type we're "cloning" we can do something more bespoke but efficient.

rturner3 commented 1 year ago

Hi @onesafe, out of curiosity, what are the net effects on the server performance you're noticing because of these GC events (e.g. high CPU usage, additional system latency due to stop-the-world events)?

I also wasn't sure how to interpret the graph you attached. Could you provide some more information on what data that graph is displaying?

onesafe commented 1 year ago

Hi，I have remove cloneEntries and applyMask in GetAuthorizedEntries. Found that the average time consumption has dropped from 30ms to 5ms，and the cpu load has significantly decreased. I have a question, can I operate like this if I don't use the applyMask function? What are the typical scenarios for using applyMask?

azdagron commented 1 year ago

A few things to note:

SPIRE Agents set the mask on the request to reduce the amount of traffic sent over the network, as the agent's are not interested in many of the fields.
Applying the mask mutates what is returned from the cache. The clone step was added because in general it was unsafe to return a cache-owned pointer to callers, who could end up mutating what was held in the cache by accident (we had a bug because of this when applying the mask).

I ran a few benchmarks. In the following table, clone means cloning via proto.Clone, manual means a manual duplication of the entry struct and all fields, and shallow means a simply copy of the struct without cloning field contents.

goos: darwin
goarch: arm64
pkg: github.com/spiffe/spire/pkg/server/cache/entrycache
                                │ old.clone.txt │           old.manual.txt            │           old.shallow.txt           │
                                │    sec/op     │   sec/op     vs base                │   sec/op     vs base                │
GetAuthorizedEntriesInMemory-12     329.1µ ± 0%   222.0µ ± 0%  -32.56% (p=0.000 n=10)   135.7µ ± 0%  -58.77% (p=0.000 n=10)

                                │ old.clone.txt │         old.manual.txt         │           old.shallow.txt            │
                                │     B/op      │     B/op      vs base          │     B/op      vs base                │
GetAuthorizedEntriesInMemory-12    285.1Ki ± 0%   285.1Ki ± 0%  ~ (p=0.129 n=10)   139.8Ki ± 0%  -50.97% (p=0.000 n=10)

                                │ old.clone.txt │         old.manual.txt          │          old.shallow.txt           │
                                │   allocs/op   │  allocs/op   vs base            │ allocs/op   vs base                │
GetAuthorizedEntriesInMemory-12     3005.0 ± 0%   3005.0 ± 0%  ~ (p=1.000 n=10) ¹   605.0 ± 0%  -79.87% (p=0.000 n=10)
¹ all samples are equal

From this we can see that moving from proto.Clone to a manual copying reduced CPU by 32%, though the number of allocations does not change, so I would imagine GC pressure would be the same.

Moving from proto.Clone to a shallow copy of the struct (the least safe), shows a 60% reduction in CPU and 80% reduction in allocations. This approach would have a large impact on reducing GC pressure, thought it comes with the greatest risk of accidental cache pollution if callers were not careful with the entries returned from the cache.

It's also worth noting that we are developing a new cache. Running a similar benchmark, I see the following:

goos: darwin
goarch: arm64
pkg: github.com/spiffe/spire/pkg/server/authorizedentries
                                │ new.clone.txt │            new.manual.txt            │           new.shallow.txt           │
                                │    sec/op     │    sec/op     vs base                │   sec/op     vs base                │
GetAuthorizedEntriesInMemory-12    191.15µ ± 0%   137.84µ ± 0%  -27.89% (p=0.000 n=10)   95.35µ ± 0%  -50.12% (p=0.000 n=10)

                                │ new.clone.txt │         new.manual.txt         │           new.shallow.txt            │
                                │     B/op      │     B/op      vs base          │     B/op      vs base                │
GetAuthorizedEntriesInMemory-12    200.0Ki ± 0%   200.0Ki ± 0%  ~ (p=0.271 n=10)   127.3Ki ± 0%  -36.37% (p=0.000 n=10)

                                │ new.clone.txt │         new.manual.txt          │          new.shallow.txt           │
                                │   allocs/op   │  allocs/op   vs base            │ allocs/op   vs base                │
GetAuthorizedEntriesInMemory-12     1513.0 ± 0%   1513.0 ± 0%  ~ (p=1.000 n=10) ¹   313.0 ± 0%  -79.31% (p=0.000 n=10)
¹ all samples are equal

There are similar ratios of reduction across the three "cloning" methods. Its worth pointing out that the authorized entry crawl in the worst case in the new cache is still twice as fast as the old cache. The number of allocations is also about half of the old cache.

@onesafe, also, I'm not familiar with the unit w in 1w+ on your original post. Can you clarify that unit?

onesafe commented 1 year ago

oh sorry, 1w means 10000 registry entries。

if callers were not careful with the entries returned from the cache.

Can you give an example?

azdagron commented 1 year ago

Yes, see 7ea53c4cac46fbc99edcf26efc369bcc273bb60b.

How much of a reduction would be good enough for your purposes and how quickly are you in need of a solution?

The reason I ask is that there are two features that will arrive in the next few months that could improve this significantly without getting rid of the safety that the cloning provides.

The first is the new cache we are developing. As you can see from the above benchmarks, it reduces allocations in half. This cache should be available experimentally in 1.9.0 (Q4 2023) and likely be the default in 1.10.0 (Q1 2024).

The second is the new sync RPC that is currently under test. This RPC severely reduces the amount of traffic in all but the first sync of a given agent by only sending what has changed to each agent. It would only need to clone the new/updated entries that needed to be returned.

onesafe commented 1 year ago

I'm not in a hurry for a solution, and these two features you mentioned are great. I look forward to their launch.

github-actions[bot] commented 3 weeks ago

This issue is stale because it has been open for 365 days with no activity.

spiffe / spire

[HELP] frequent GC, Can we use sync.Pool in cloneEntries? #4515