Open crockeo opened 1 year ago
I don't love that it involves getting a new dependency into the stat lib, I didn't like having envconfig there either. I was hoping we can get to zero.
Real talk what would it take to copy some of the bits of the lru cache into one file here?
Fixed in 1fa1009.
It was providing a map which contained a unique identifier each time it was called
Just on this point @crockeo - this is obviously not a way the library is intended to be used - was this fixed in the service?
Yeah, this was always a risk with using sync.Map
but I figured it was only possible if there was a cardinality explosion in the metrics and observability would catch it / inform the service owners. But maybe that's not true with tags and TBH its been awhile.
this is obviously not a way the library is intended to be used - was this fixed in the service
Agreed that this isn't the intended use case, and yes it was fixed in the service. My thought is that we should design the library to be resilient to misuse if it doesn't impose add performance issues or UX friction for the golden path. Closing a PR is easier than merging it, though, so please let me know if you disagree.
@crockeo I tried running the benchmarks for a comparison, but BenchmarkStoreNewPerInstanceCounter
panicked.
$ go test -run XXX -bench BenchmarkStoreNewPerInstanceCounter
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x100f854b8]
goroutine 6 [running]:
github.com/lyft/gostats/internal/vendored/hashicorp/golang-lru.(*Cache[...]).Get(0x10105f160, {0x1400010a000, 0x1f})
/Users/cvieth/go/src/github.com/lyft/gostats/internal/vendored/hashicorp/golang-lru/lru.go:96 +0x28
github.com/lyft/gostats.(*statStore).newCounter(0x140000f9d88, {0x1400010a000, 0x1f})
/Users/cvieth/go/src/github.com/lyft/gostats/stats.go:428 +0x40
github.com/lyft/gostats.(*statStore).NewCounterWithTags(0x10101a900?, {0x100f8c517?, 0x100f8c18e?}, 0x2?)
/Users/cvieth/go/src/github.com/lyft/gostats/stats.go:443 +0x40
github.com/lyft/gostats.(*statStore).NewPerInstanceCounter(0x10101a900?, {0x100f8c517, 0x4}, 0x14000191558)
/Users/cvieth/go/src/github.com/lyft/gostats/stats.go:457 +0x74
github.com/lyft/gostats.BenchmarkStoreNewPerInstanceCounter.func1(0x140000b9180)
/Users/cvieth/go/src/github.com/lyft/gostats/stats_test.go:479 +0x1e4
testing.(*B).runN(0x140000b9180, 0x1)
/opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:193 +0x12c
testing.(*B).run1.func1()
/opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:233 +0x50
created by testing.(*B).run1
/opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:226 +0x90
exit status 2
FAIL github.com/lyft/gostats 0.009s
I tried running the benchmarks for a comparison, but BenchmarkStoreNewPerInstanceCounter panicked.
Looks like I missed some test cases initializing statStore
with a zero-value. E.g. for the benchmark you mentioned: https://github.com/lyft/gostats/blob/e6b88e67ba70cb8f1dbc2dcc020f779b3e17ab3e/stats_test.go#L471
I'll make sure to fix those before merging 🙂
Benching this with the below function shows a 3443.27% slowdown (mutex contention + list update), but ~5ns is a low base and the new time of ~171ns shouldn't be an issue (though the mutex contention does worry me a bit). That said, it's been awhile since I was at Lyft so I have no idea what's acceptable these days (I also made this thing as stupid fast as possible partly for my own enjoyment).
goos: darwin
goarch: arm64
pkg: github.com/lyft/gostats
│ map10.txt │ lru10.txt │
│ sec/op │ sec/op vs base │
StoreNewCounterParallel-10 4.819n ± 4% 170.750n ± 2% +3443.27% (p=0.000 n=10)
New benchmark that just stresses the store:
// New benchmark that exclusively stresses the store
func BenchmarkStoreNewCounterParallel(b *testing.B) {
s := NewStore(nullSink{}, false)
t := time.NewTicker(time.Hour) // don't flush
defer t.Stop()
go s.Start(t)
names := new([2048]string)
for i := 0; i < len(names); i++ {
names[i] = "counter_" + strconv.Itoa(i)
}
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for i := 0; pb.Next(); i++ {
s.NewCounter(names[i%len(names)])
}
})
}
range thing looks like it does a good job at bringing down contention:
this PR: BenchmarkStore_MutexContention-10 4006423 297.9 ns/op
with range: BenchmarkStore_MutexContention-10 16804858 69.96 ns/op
original: BenchmarkStore_MutexContention-10 20921458 56.88 ns/op
but i'm afraid it's broken. i'll do more testing
@crockeo One idea that I had many years ago while working on this library and which I actually wrote the code was to tag counters and timers as used/unused and remove ones that went unused N times between flushes. The commit is here https://github.com/lyft/gostats/pull/160/commits/3b65338e34dc99be0c82ae3660440cee475c4d45 and I made a PR so that it's easier to track: https://github.com/lyft/gostats/pull/160.
The advantage of this approach is that you eliminate the need for locking around any of the Counter/Timer methods. Also, feel free to steal all the code you want from this commit/PR or just totally ignore it.
I was looking at an application which uses gostats earlier today and found that it was leaking memory because it was misusing the gostats API. At a high-level:
NewTimerWithTags
with each request.statStore
uses an unbounded cache,timers
was growing with each request.This PR attempts to fix these issues by replacing the existing, unbounded cache with a bounded LRU cache. We can only do so for counters and timers because they are resilient to cache eviction:
Counters report the diff between the last sent value and the current value. Because we flush at the time of eviction there is no different between
value = x
-> eviction ->add y
andvalue = x
->add y
. In both cases the next report will containy
.Timers don't maintain local state, instead they immediately call
Sink.FlushTimer
when a value is added. They can be evicted from the cache because any twoAddValue
calls to the same timer are independent, and so they will behave the same across two instances of the same timer.Notably we cannot replace gauge's
sync.Map
withlru.Cache
, because a gauge is client-side stateful. Users can still encounter memory issues if they produce gauges with unique tags.