honeycombio / beeline-go

Legacy instrumentation for golang apps with Honeycomb
https://honeycomb.io
Apache License 2.0
74 stars 48 forks source link

Cache benchmarks #405

Closed kentquirk closed 11 months ago

kentquirk commented 11 months ago

So I wanted to compare some ideas for improving this, and ended up spending more than an hour writing benchmarks and trying ideas.

All that is in here, but the summary is that we get a lot more predictable performance by using sync.Map instead of Go's native map. The alternative concurrent-map doesn't actually do better in this case.

The function getPrefixedFieldNameSync in this PR is the best-performing choice, IMO.

Below are the results of the benchmark. orig is the original code (prepending app unconditionally), new is Mike's implementation, sync is using sync.Map, conc is using concurrent-map.

f50 means that there are 50 different field names, g10 means there are 10 concurrent goroutines accessing the cache. So f2000-g300 is the most stressful.

I also tried the new algorithm but replaced a random value in the map instead of recreating it when it was full. This was not a winning strategy, so I didn't include it in the benchmark (but the code is there).

My recommendation is that we replace the current design in the PR with the contents of getPrefixedFieldNameSync and call it a day.

Running tool: /opt/homebrew/bin/go test -benchmem -run=^$ -bench ^BenchmarkGetPrefixedFieldNameParallel$ github.com/honeycombio/beeline-go -count=1

goos: darwin
goarch: arm64
pkg: github.com/honeycombio/beeline-go
BenchmarkGetPrefixedFieldNameParallel/orig-f50-g1-10            32143990            36.77 ns/op       22 B/op          1 allocs/op
BenchmarkGetPrefixedFieldNameParallel/orig-f500-g1-10           32188372            36.89 ns/op       22 B/op          1 allocs/op
BenchmarkGetPrefixedFieldNameParallel/orig-f2000-g1-10          27603550            38.83 ns/op       22 B/op          1 allocs/op
BenchmarkGetPrefixedFieldNameParallel/new-f50-g1-10             30958353            42.13 ns/op        0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/new-f500-g1-10            28620691            42.92 ns/op        0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/new-f2000-g1-10            9467629           126.7 ns/op       133 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/sync-f50-g1-10            25753398            49.62 ns/op        0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/sync-f500-g1-10           23410959            51.71 ns/op        0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/sync-f2000-g1-10          19778806            58.74 ns/op        0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/conc-f50-g1-10            29850870            38.17 ns/op        0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/conc-f500-g1-10           19890284            60.95 ns/op        0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/conc-f2000-g1-10          18314028            66.09 ns/op        0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/orig-f50-g50-10            7052200           167.0 ns/op        21 B/op          1 allocs/op
BenchmarkGetPrefixedFieldNameParallel/orig-f500-g50-10           7049428           168.0 ns/op        22 B/op          1 allocs/op
BenchmarkGetPrefixedFieldNameParallel/orig-f2000-g50-10          7073343           170.0 ns/op        22 B/op          1 allocs/op
BenchmarkGetPrefixedFieldNameParallel/new-f50-g50-10             7069659           172.4 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/new-f500-g50-10            6797158           175.1 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/new-f2000-g50-10           4996476           242.6 ns/op       132 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/sync-f50-g50-10            7104078           162.8 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/sync-f500-g50-10           7332326           185.1 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/sync-f2000-g50-10          6079827           192.7 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/conc-f50-g50-10            6289045           193.2 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/conc-f500-g50-10           5752580           209.4 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/conc-f2000-g50-10          5574608           214.3 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/orig-f50-g300-10           7138944           164.4 ns/op        22 B/op          1 allocs/op
BenchmarkGetPrefixedFieldNameParallel/orig-f500-g300-10          7169821           168.3 ns/op        22 B/op          1 allocs/op
BenchmarkGetPrefixedFieldNameParallel/orig-f2000-g300-10         6930981           171.2 ns/op        22 B/op          1 allocs/op
BenchmarkGetPrefixedFieldNameParallel/new-f50-g300-10            6868586           168.4 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/new-f500-g300-10           6978414           181.3 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/new-f2000-g300-10          4246818           284.8 ns/op       130 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/sync-f50-g300-10           7042858           169.2 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/sync-f500-g300-10          6347726           188.3 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/sync-f2000-g300-10         5882048           200.0 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/conc-f50-g300-10           6072379           189.0 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/conc-f500-g300-10          5719530           214.3 ns/op         0 B/op          0 allocs/op
BenchmarkGetPrefixedFieldNameParallel/conc-f2000-g300-10         5473842           218.5 ns/op         0 B/op          0 allocs/op
PASS
ok      github.com/honeycombio/beeline-go   49.900s
robbkidd commented 11 months ago

I also tried the new algorithm but replaced a random value in the map instead of recreating it when it was full. This was not a winning strategy, so I didn't include it in the benchmark (but the code is there).

With new + EjectRandom disqualified, do you think it is worth trying the recommended sync algorithm + EjectRandom to constrain map size?

kentquirk commented 11 months ago

I was going to say no, because sync.Map doesn't even have a Size() or Len() function, so we'd have to implement that separately. And then I just remembered that this cache is global to the process and thus it's possible to grow very large. I'm going to implement that and try it. Thanks for the poke.

kentquirk commented 11 months ago

Update -- I spent a while trying to limit the size of the caches in various ways, and all of them ended up causing significant performance bottlenecks when we do so safely (behind locks).

I no longer think that we should do this at all. More in the main issue