signalfx / signalfx-go

Go client library and instrumentation bindings for SignalFx
https://www.signalfx.com
Apache License 2.0
14 stars 48 forks source link

wrap Client.channelsByName in mutex #73

Closed rma-stripe closed 4 years ago

rma-stripe commented 4 years ago

When reusing a Client instance over the course of multiple (c *Client) Execute calls, the lack of a mutex in client.go causes a data race.

I wrote an example program at: https://gist.github.com/rma-stripe/efe7d7b650baae4c3a1a7a71f6eb1462

To run:

> SFX_AUTH_TOKEN=... go run -race reproduce_race.go
2020/03/11 11:39:18 data('system.uptime', filter=partition_filter(1, 50) and filter('host_type', 'omg')).count().publish()
2020/03/11 11:39:18 data('system.uptime', filter=partition_filter(1, 50) and filter('host_type', 'omg')).count().publish()
2020/03/11 11:39:18 data('system.uptime', filter=partition_filter(1, 50) and filter('host_type', 'omg')).count().publish()
2020/03/11 11:39:18 data('system.uptime', filter=partition_filter(1, 50) and filter('host_type', 'omg')).count().publish()
2020/03/11 11:39:19 data('system.uptime', filter=partition_filter(2, 50) and filter('host_type', 'omg')).count().publish()
==================
WARNING: DATA RACE
Read at 0x00c00012e000 by goroutine 18:
  runtime.mapaccess2_faststr()
      /usr/local/Cellar/go/1.13.6/libexec/src/runtime/map_faststr.go:107 +0x0
  github.com/signalfx/signalfx-go/signalflow.(*Client).handleMessage()
      /Users/rma/go/src/github.com/signalfx/signalfx-go/signalflow/client.go:239 +0x207
  github.com/signalfx/signalfx-go/signalflow.(*Client).run()
      /Users/rma/go/src/github.com/signalfx/signalfx-go/signalflow/client.go:184 +0x46b

Previous write at 0x00c00012e000 by goroutine 9:
  runtime.mapassign_faststr()
      /usr/local/Cellar/go/1.13.6/libexec/src/runtime/map_faststr.go:202 +0x0
  github.com/signalfx/signalfx-go/signalflow.(*Client).registerChannel()
      /Users/rma/go/src/github.com/signalfx/signalfx-go/signalflow/client.go:306 +0x1cb
  github.com/signalfx/signalfx-go/signalflow.(*Client).Execute()
      /Users/rma/go/src/github.com/signalfx/signalfx-go/signalflow/client.go:288 +0x1fd
  main.main.func1()
      /Users/rma/Downloads/reproduce_race.go:26 +0x336

Goroutine 18 (running) created at:
  github.com/signalfx/signalfx-go/signalflow.NewClient()
      /Users/rma/go/src/github.com/signalfx/signalfx-go/signalflow/client.go:166 +0xa8d
  main.main.func1()
      /Users/rma/Downloads/reproduce_race.go:19 +0xd3

Goroutine 9 (running) created at:
  main.main()
      /Users/rma/Downloads/reproduce_race.go:18 +0xcd
==================
2020/03/11 11:39:19 data('system.uptime', filter=partition_filter(2, 50) and filter('host_type', 'omg')).count().publish()
2020/03/11 11:39:19 data('system.uptime', filter=partition_filter(2, 50) and filter('host_type', 'omg')).count().publish()
2020/03/11 11:39:19 data('system.uptime', filter=partition_filter(3, 50) and filter('host_type', 'omg')).count().publish()
^Csignal: interrupt

The specific segfault we hit was:

fatal error: concurrent map read and map write

goroutine 48 [running]:
runtime.throw(0x132ca00, 0x21)
        /usr/local/Cellar/go/1.13.6/libexec/src/runtime/panic.go:774 +0x72 fp=0xc004079d98 sp=0xc004079d68 pc=0x102ec72
runtime.mapaccess2_faststr(0x12c0e40, 0xc0000a8cf0, 0xc056663718, 0x4, 0xc0000c16e0, 0xc0000c1601)
        /usr/local/Cellar/go/1.13.6/libexec/src/runtime/map_faststr.go:116 +0x48f fp=0xc004079e08 sp=0xc004079d98 pc=0x101440f
github.com/signalfx/signalfx-go/signalflow.(*Client).handleMessage(0xc0000c6280, 0xc059f02000, 0x75, 0x600, 0x1, 0x0, 0x0)
        /Users/rma/go/src/github.com/signalfx/signalfx-go/signalflow/client.go:239 +0x161 fp=0xc004079e80 sp=0xc004079e08 pc=0x127a181
github.com/signalfx/signalfx-go/signalflow.(*Client).run(0xc0000c6280)
        /Users/rma/go/src/github.com/signalfx/signalfx-go/signalflow/client.go:184 +0x2c5 fp=0xc004079fd8 sp=0xc004079e80 pc=0x1279cb5
runtime.goexit()
        /usr/local/Cellar/go/1.13.6/libexec/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc004079fe0 sp=0xc004079fd8 pc=0x105b0c1
created by github.com/signalfx/signalfx-go/signalflow.NewClient
        /Users/rma/go/src/github.com/signalfx/signalfx-go/signalflow/client.go:166 +0x48b

As I understand it, channels correspond to computations. So I believe we segfaulted when we called Execute --> registerChannel --> write to channelsByName at the same time as a message came in over the websocket, resulting in handleMessage --> read from channelsByName.

With this patch, we no longer can reproduce the race condition using go run -race, and we no longer segfault.

cory-signalfx commented 4 years ago

One of those marvelous PRs that is 2 lines with a page of explanation. Thank you so much @rma-stripe!