[RFC] replace `sync.Map` with `lru.Cache`

crockeo commented 1 year ago

I was looking at an application which uses gostats earlier today and found that it was leaking memory because it was misusing the gostats API. At a high-level:

The application was calling NewTimerWithTags with each request.
It was providing a map which contained a unique identifier each time it was called.
Because statStore uses an unbounded cache, timers was growing with each request.
That caused a non-obvious memory leak on the service.

This PR attempts to fix these issues by replacing the existing, unbounded cache with a bounded LRU cache. We can only do so for counters and timers because they are resilient to cache eviction:

Counters report the diff between the last sent value and the current value. Because we flush at the time of eviction there is no different between value = x -> eviction -> add y and value = x -> add y. In both cases the next report will contain y.
Timers don't maintain local state, instead they immediately call Sink.FlushTimer when a value is added. They can be evicted from the cache because any two AddValue calls to the same timer are independent, and so they will behave the same across two instances of the same timer.

Notably we cannot replace gauge's sync.Map with lru.Cache, because a gauge is client-side stateful. Users can still encounter memory issues if they produce gauges with unique tags.

crockeo commented 1 year ago

I don't love that it involves getting a new dependency into the stat lib, I didn't like having envconfig there either. I was hoping we can get to zero.

Real talk what would it take to copy some of the bits of the lru cache into one file here?

Fixed in 1fa1009.

njo commented 1 year ago

It was providing a map which contained a unique identifier each time it was called Just on this point @crockeo - this is obviously not a way the library is intended to be used - was this fixed in the service?

charlievieth commented 1 year ago

Yeah, this was always a risk with using sync.Map but I figured it was only possible if there was a cardinality explosion in the metrics and observability would catch it / inform the service owners. But maybe that's not true with tags and TBH its been awhile.

crockeo commented 1 year ago

this is obviously not a way the library is intended to be used - was this fixed in the service

Agreed that this isn't the intended use case, and yes it was fixed in the service. My thought is that we should design the library to be resilient to misuse if it doesn't impose add performance issues or UX friction for the golden path. Closing a PR is easier than merging it, though, so please let me know if you disagree.

charlievieth commented 1 year ago

@crockeo I tried running the benchmarks for a comparison, but BenchmarkStoreNewPerInstanceCounter panicked.

$ go test -run XXX -bench BenchmarkStoreNewPerInstanceCounter
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x100f854b8]

goroutine 6 [running]:
github.com/lyft/gostats/internal/vendored/hashicorp/golang-lru.(*Cache[...]).Get(0x10105f160, {0x1400010a000, 0x1f})
    /Users/cvieth/go/src/github.com/lyft/gostats/internal/vendored/hashicorp/golang-lru/lru.go:96 +0x28
github.com/lyft/gostats.(*statStore).newCounter(0x140000f9d88, {0x1400010a000, 0x1f})
    /Users/cvieth/go/src/github.com/lyft/gostats/stats.go:428 +0x40
github.com/lyft/gostats.(*statStore).NewCounterWithTags(0x10101a900?, {0x100f8c517?, 0x100f8c18e?}, 0x2?)
    /Users/cvieth/go/src/github.com/lyft/gostats/stats.go:443 +0x40
github.com/lyft/gostats.(*statStore).NewPerInstanceCounter(0x10101a900?, {0x100f8c517, 0x4}, 0x14000191558)
    /Users/cvieth/go/src/github.com/lyft/gostats/stats.go:457 +0x74
github.com/lyft/gostats.BenchmarkStoreNewPerInstanceCounter.func1(0x140000b9180)
    /Users/cvieth/go/src/github.com/lyft/gostats/stats_test.go:479 +0x1e4
testing.(*B).runN(0x140000b9180, 0x1)
    /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:193 +0x12c
testing.(*B).run1.func1()
    /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:233 +0x50
created by testing.(*B).run1
    /opt/homebrew/Cellar/go/1.20.5/libexec/src/testing/benchmark.go:226 +0x90
exit status 2
FAIL    github.com/lyft/gostats 0.009s

crockeo commented 1 year ago

I tried running the benchmarks for a comparison, but BenchmarkStoreNewPerInstanceCounter panicked.

Looks like I missed some test cases initializing statStore with a zero-value. E.g. for the benchmark you mentioned: https://github.com/lyft/gostats/blob/e6b88e67ba70cb8f1dbc2dcc020f779b3e17ab3e/stats_test.go#L471

I'll make sure to fix those before merging 🙂

charlievieth commented 1 year ago

Benching this with the below function shows a 3443.27% slowdown (mutex contention + list update), but ~5ns is a low base and the new time of ~171ns shouldn't be an issue (though the mutex contention does worry me a bit). That said, it's been awhile since I was at Lyft so I have no idea what's acceptable these days (I also made this thing as stupid fast as possible partly for my own enjoyment).

goos: darwin
goarch: arm64
pkg: github.com/lyft/gostats
                           │  map10.txt  │                lru10.txt                │
                           │   sec/op    │    sec/op      vs base                  │
StoreNewCounterParallel-10   4.819n ± 4%   170.750n ± 2%  +3443.27% (p=0.000 n=10)

New benchmark that just stresses the store:

// New benchmark that exclusively stresses the store
func BenchmarkStoreNewCounterParallel(b *testing.B) {
    s := NewStore(nullSink{}, false)
    t := time.NewTicker(time.Hour) // don't flush
    defer t.Stop()
    go s.Start(t)
    names := new([2048]string)
    for i := 0; i < len(names); i++ {
        names[i] = "counter_" + strconv.Itoa(i)
    }
    b.ResetTimer()
    b.RunParallel(func(pb *testing.PB) {
        for i := 0; pb.Next(); i++ {
            s.NewCounter(names[i%len(names)])
        }
    })
}

crockeo commented 1 year ago

range thing looks like it does a good job at bringing down contention:

this PR:    BenchmarkStore_MutexContention-10   4006423  297.9  ns/op
with range: BenchmarkStore_MutexContention-10  16804858   69.96 ns/op
original:   BenchmarkStore_MutexContention-10  20921458   56.88 ns/op

but i'm afraid it's broken. i'll do more testing

charlievieth commented 1 year ago

@crockeo One idea that I had many years ago while working on this library and which I actually wrote the code was to tag counters and timers as used/unused and remove ones that went unused N times between flushes. The commit is here https://github.com/lyft/gostats/pull/160/commits/3b65338e34dc99be0c82ae3660440cee475c4d45 and I made a PR so that it's easier to track: https://github.com/lyft/gostats/pull/160.

The advantage of this approach is that you eliminate the need for locking around any of the Counter/Timer methods. Also, feel free to steal all the code you want from this commit/PR or just totally ignore it.

lyft / gostats

[RFC] replace `sync.Map` with `lru.Cache` #158