signalfx / signalfx-go

Go client library and instrumentation bindings for SignalFx
https://www.signalfx.com
Apache License 2.0
14 stars 48 forks source link

Data races when retrieving metadata #96

Closed kerbyhughes closed 4 years ago

kerbyhughes commented 4 years ago

There are several data race conditions in computation.go. These are detectable as of v1.7.6, with the existing tests using go test -race. Here is one test:

Click to expand test output ``` ~/signalfx-go/signalflow$ go test -race -run TestBuffersDataMessages ================== WARNING: DATA RACE Write at 0x00c0000a9230 by goroutine 9: runtime.mapassign_fast64() /usr/local/Cellar/go@1.12/1.12.17/libexec/src/runtime/map_fast64.go:92 +0x0 github.com/signalfx/signalfx-go/signalflow.(*Computation).processMessage() /Users/kerby.hughes/signalfx-go/signalflow/computation.go:271 +0x12e7 github.com/signalfx/signalfx-go/signalflow.(*Computation).watchMessages() /Users/kerby.hughes/signalfx-go/signalflow/computation.go:219 +0x1a1 Previous read at 0x00c0000a9230 by goroutine 6: runtime.mapaccess1_fast64() /usr/local/Cellar/go@1.12/1.12.17/libexec/src/runtime/map_fast64.go:12 +0x0 github.com/signalfx/signalfx-go/signalflow.(*Computation).TSIDMetadata.func1() /Users/kerby.hughes/signalfx-go/signalflow/computation.go:190 +0x87 github.com/signalfx/signalfx-go/signalflow.(*Computation).waitForMetadata() /Users/kerby.hughes/signalfx-go/signalflow/computation.go:106 +0xa9 github.com/signalfx/signalfx-go/signalflow.(*Computation).TSIDMetadata() /Users/kerby.hughes/signalfx-go/signalflow/computation.go:190 +0x72 github.com/signalfx/signalfx-go/signalflow.TestBuffersDataMessages() /Users/kerby.hughes/signalfx-go/signalflow/computation_test.go:52 +0x44a testing.tRunner() /usr/local/Cellar/go@1.12/1.12.17/libexec/src/testing/testing.go:865 +0x163 Goroutine 9 (running) created at: github.com/signalfx/signalfx-go/signalflow.newComputation() /Users/kerby.hughes/signalfx-go/signalflow/computation.go:83 +0x3e5 github.com/signalfx/signalfx-go/signalflow.TestBuffersDataMessages() /Users/kerby.hughes/signalfx-go/signalflow/computation_test.go:37 +0x248 testing.tRunner() /usr/local/Cellar/go@1.12/1.12.17/libexec/src/testing/testing.go:865 +0x163 Goroutine 6 (running) created at: testing.(*T).Run() /usr/local/Cellar/go@1.12/1.12.17/libexec/src/testing/testing.go:916 +0x65a testing.runTests.func1() /usr/local/Cellar/go@1.12/1.12.17/libexec/src/testing/testing.go:1157 +0xa8 testing.tRunner() /usr/local/Cellar/go@1.12/1.12.17/libexec/src/testing/testing.go:865 +0x163 testing.runTests() /usr/local/Cellar/go@1.12/1.12.17/libexec/src/testing/testing.go:1155 +0x523 testing.(*M).Run() /usr/local/Cellar/go@1.12/1.12.17/libexec/src/testing/testing.go:1072 +0x2eb main.main() _testmain.go:76 +0x222 ================== --- FAIL: TestBuffersDataMessages (0.00s) testing.go:809: race detected during execution of test FAIL exit status 1 FAIL github.com/signalfx/signalfx-go/signalflow 0.195s ~/signalfx-go/signalflow$ ```

When using this library as in the example, the computation data channel returns messages whose payloads include a value and a tsid, which is an internal identifier. In order to get the metadata for a value, the user of this library calls the TSIDMetadata function which looks up the tsid in an internal map. However, the computation may still be writing data to this internal map, and the map is not thread-safe, causing the runtime to panic. The calls to waitForMetadata which are used only use a lock on the first read, not the read in the return statement, and there are no write locks to these shared values in processMessage

The example code seems to avoid thse conditions because of the blocking print calls to comp.Resolution(), comp.MaxDelay(), and comp.Lag(). This ensures that by the time the example code reads from the internal metadata map, the computation is done writing. All of the calls to shared data on the computation are data races, however since they only read struct values and not map lookups, they do not cause the runtime to panic.

The panic can be seen with the example code by commenting out the informational metadata calls, and providing a signalflow program that immediately starts returning data:

~/signalfx-go/signalflow/example$ git diff
diff --git a/signalflow/example/main.go b/signalflow/example/main.go
index cdf97ea..458d516 100644
--- a/signalflow/example/main.go
+++ b/signalflow/example/main.go
@@ -23,7 +23,7 @@ func main() {

        program := os.Getenv("SIGNALFLOW_PROGRAM")
        if program == "" {
-               program = "data('cpu.utilization').publish()"
+               program = "data('cpu.utilization').mean(cycle='hour', shift_cycles=1).publish()"
        }

        comp, err := c.Execute(&signalflow.ExecuteRequest{
@@ -34,9 +34,9 @@ func main() {
                return
        }

-       fmt.Printf("Resolution: %v\n", comp.Resolution())
-       fmt.Printf("Max Delay: %v\n", comp.MaxDelay())
-       fmt.Printf("Detected Lag: %v\n", comp.Lag())
+       //fmt.Printf("Resolution: %v\n", comp.Resolution())
+       //fmt.Printf("Max Delay: %v\n", comp.MaxDelay())
+       //fmt.Printf("Detected Lag: %v\n", comp.Lag())

        go func() {
                for msg := range comp.Expirations() {
~/signalfx-go/signalflow/example$

~/signalfx-go/signalflow/example$ go run main.go
some data comes in...
cpu.utilization map[redacted]: 2.0975595494830506
cpu.utilization map[redacted]: 4.8673436697145425
cpu.utilization map[redacted]: 19.83757310720653
fatal error: concurrent map read and map write

goroutine 1 [running]:
runtime.throw(0x13114d1, 0x21)
    /usr/local/Cellar/go@1.12/1.12.17/libexec/src/runtime/panic.go:617 +0x72 fp=0xc000118db8 sp=0xc000118d88 pc=0x102d092
runtime.mapaccess1_fast64(0x12ad020, 0xc000148090, 0x6c9bb1ce, 0x0)
    /usr/local/Cellar/go@1.12/1.12.17/libexec/src/runtime/map_fast64.go:21 +0x1b3 fp=0xc000118de0 sp=0xc000118db8 pc=0x1012423
github.com/signalfx/signalfx-go/signalflow.(*Computation).TSIDMetadata(0xc00009a340, 0x6c9bb1ce, 0x1309b01)
    /Users/kerby.hughes/signalfx-go/signalflow/computation.go:193 +0xab fp=0xc000118e28 sp=0xc000118de0 pc=0x126a63b
main.main()
    /Users/kerby.hughes/signalfx-go/signalflow/example/main.go:55 +0x3d8 fp=0xc000118f98 sp=0xc000118e28 pc=0x126e7b8
runtime.main()
    /usr/local/Cellar/go@1.12/1.12.17/libexec/src/runtime/proc.go:200 +0x20c fp=0xc000118fe0 sp=0xc000118f98 pc=0x102e9ec
runtime.goexit()
    /usr/local/Cellar/go@1.12/1.12.17/libexec/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000118fe8 sp=0xc000118fe0 pc=0x1058b61

Fixes:

One possible solution is to use the existing locking struct updateSignal to wrap all reads and writes to these shared values on the Computation struct. I will submit a PR with these changes.

Alternatives:

Return metadata over blocking channels, so that the caller is safe to wait for these in any order.

Thanks