newrelic / go-agent

New Relic Go Agent
Apache License 2.0
775 stars 295 forks source link

Heavily contended lock in TraceIDGenerator.generateID #767

Open ankon opened 1 year ago

ankon commented 1 year ago

https://github.com/newrelic/go-agent/blob/046b4fc116adac2b4dec4d691b61a25b80488fc2/v3/internal/trace_id_generator.go#L52-L58

In a heavily used server a lot of segments end at the same time, and from what see in the profiles this lock is a problem.

  1. Can it be avoided at all? For instance, using some magic "per-thread" generator (as much as that is possible, but possibly just having more than one rand instance might work?)
  2. If it cannot be avoided, can it be made smaller? I see that it uses defer to unlock, which means it unlocks after the conversion of the bits to the hex representation needed in the ID. Maybe it can be explicitly unlocked at least before that?

image

ankon commented 3 weeks ago

FWIW: I tried replacing the defer tg.Unlock() with an explicit unlock right after tg.rnd.Read, but benchmarks seem to imply that at least on my local machine that makes things slower.

What does help (see #977) is avoiding the allocation for the EncodeToString, and doing it in-place.

iamemilio commented 3 weeks ago

I do think that an explicit unlock may improve the async performance of the system as a whole, since the lock will not have to wait for the encoding process, even if it isn't necessarily faster for the performance of this function. According to your pprof output, it does look like a majority of the execution time is being spent on allocating and garbage collecting memory. We will take a look at your PR, I just enabled the tests and left a comment.

ankon commented 3 weeks ago

The fun one is: I benchmarked moving the unlock ... and it makes things slower.

goos: linux
goarch: amd64
pkg: github.com/newrelic/go-agent/v3/internal
cpu: AMD Ryzen 9 7940HS w/ Radeon 780M Graphics
BenchmarkTraceIDGenerator
BenchmarkTraceIDGenerator-16                        25461540            44.09 ns/op       32 B/op          2 allocs/op
BenchmarkTraceIDGeneratorParallel
BenchmarkTraceIDGeneratorParallel-16                10721215           107.5 ns/op        32 B/op          2 allocs/op
BenchmarkTraceIDGeneratorBaseline
BenchmarkTraceIDGeneratorBaseline-16                24150187            44.59 ns/op       32 B/op          2 allocs/op
BenchmarkTraceIDGeneratorBaselineParallel
BenchmarkTraceIDGeneratorBaselineParallel-16        15261769            74.36 ns/op       32 B/op          2 allocs/op

(baseline is simply the original generator, and non-baseline is the generator with the defer replaced with the explicit unlock)

iamemilio commented 3 weeks ago

The fun one is: I benchmarked moving the unlock ... and it makes things slower.

🤯 How weird..

Anyway, the proposed changes will be in the next release. Thanks for taking the time. I am going to copy the merge comment from the pr here:

I think what you have proposed is a good incremental improvement, and I'm happy to merge it. There are a few oddities about the design and implementation of this object IMO, and I think you're right about the seeding. If anything, seeding based on timestamps may actually be decreasing the randomness. Luckily, the odds of a duplicate are incredibly low. I don't see the point of tests with a specific seed either, that doesn't really verify much of the expected/intended behavior.

If you wanted to keep playing with this, I don't mind reviewing another PR. Otherwise, we can create a risk item to address this. I think a buffered channel of pre-generated Transaction_IDs, and another for Span_IDs may do the trick. Creating a pool of generators may increase the risk of duplication depending on how they are seeded, and I am not sure the possible performance upside would be worth the added complexity of mitigating that.

ankon commented 3 weeks ago

🤯 How weird..

Yup. In go 1.14 defers got pretty fast (see https://go.googlesource.com/go/+/be64a19d99918c843f8555aad580221207ea35bc; this is an open-coded defer indeed), but that doesn't explain the "why is it faster" question.

Playing in compiler explorer (https://godbolt.org/z/4W4a17jdn) shows some differences, but my low-level go-foo isn't good enough to understand it. Benchmarks to the rescue :)

I think a buffered channel of pre-generated Transaction_IDs, and another for Span_IDs may do the trick. Creating a pool of generators may increase the risk of duplication depending on how they are seeded, and I am not sure the possible performance upside would be worth the added complexity of mitigating that.

I might play with that eventually, for now I'm looking forward to seeing how much this improves our situation in the real world. Iterating, piece by piece.