uber-go / tally

A Go metrics interface with fast buffered metrics and third party reporters
MIT License
850 stars 116 forks source link

Data race between `ForEachScope` and `subscope` can cause immediate process crash #211

Closed Groxx closed 1 year ago

Groxx commented 1 year ago

ForEachScope does not protect scopedbucket.s with the mutex that subscope holds while mutating, leading to crashes like this (for an older version):

Write at 0x00c0004ded20 by goroutine 132:
  runtime.mapassign_faststr()
      GOROOT/src/runtime/map_faststr.go:203 +0x0
  github.com/uber-go/tally.(*scopeRegistry).Subscope()
      external/com_github_uber_go_tally/scope_registry.go:181 +0x1270
  github.com/uber-go/tally.(*scope).subscope()
      external/com_github_uber_go_tally/scope.go:434 +0x84
  github.com/uber-go/tally.(*scope).Tagged()
      external/com_github_uber_go_tally/scope.go:425 +0x2c
...

Previous read at 0x00c0004ded20 by goroutine 75:
  runtime.mapiternext()
      GOROOT/src/runtime/map.go:866 +0x0
  github.com/uber-go/tally.(*scopeRegistry).ForEachScope()
      external/com_github_uber_go_tally/scope_registry.go:109 +0x172
  github.com/uber-go/tally.(*scope).Snapshot()
      external/com_github_uber_go_tally/scope.go:447 +0x1a4
...

The cause is pretty clear, reads of line 128 here: https://github.com/uber-go/tally/blob/007e844c400c1a16420fe6019d12f010acb11f73/scope_registry.go#L126-L134 race with writes at the end of subscope: https://github.com/uber-go/tally/blob/007e844c400c1a16420fe6019d12f010acb11f73/scope_registry.go#L199-L205

For that range to be correct, it needs to hold a lock around s, like subscope does:

func (r *scopeRegistry) ForEachScope(f func(*scope)) {
    for _, subscopeBucket := range r.subscopes {
        subscopeBucket.mu.RLock()
        for _, s := range subscopeBucket.s {
            f(s)
        }
        subscopeBucket.mu.RUnlock()
    }
}

I can probably make a PR for this, but I figured I'd document it first so I don't forget before I get around to it (and open it up for others to do so, if I do forget). Making sure there isn't a bad perf regression and setting up a regression test will take a bit of time.

Groxx commented 1 year ago

Fixed in #212