vapourismo / knx-go

KNX clients and protocol implementation in Go
MIT License
88 stars 58 forks source link

race condition in connection state handling #90

Closed torbenschinke closed 2 months ago

torbenschinke commented 3 months ago

During development while having the go race detector turned on, I noticed the following data race warning:

==================
WARNING: DATA RACE
Write at 0x00c0000c4550 by goroutine 26:
  runtime.recvDirect()
      /opt/homebrew/opt/go/libexec/src/runtime/chan.go:348 +0x7c
  github.com/vapourismo/knx-go/knx.(*Tunnel).process.deferwrap1()
      /Users/tschinke/go/pkg/mod/github.com/vapourismo/knx-go@v0.0.0-20240529174927-d4625b0a8913/knx/tunnel.go:471 +0x34
  runtime.deferreturn()
      /opt/homebrew/opt/go/libexec/src/runtime/panic.go:602 +0x5c
  github.com/vapourismo/knx-go/knx.(*Tunnel).serve()
      /Users/tschinke/go/pkg/mod/github.com/vapourismo/knx-go@v0.0.0-20240529174927-d4625b0a8913/knx/tunnel.go:554 +0x21c
  github.com/vapourismo/knx-go/knx.NewTunnel.gowrap1()
      /Users/tschinke/go/pkg/mod/github.com/vapourismo/knx-go@v0.0.0-20240529174927-d4625b0a8913/knx/tunnel.go:616 +0x34

Previous read at 0x00c0000c4550 by goroutine 3401:
  runtime.chansend1()
      /opt/homebrew/opt/go/libexec/src/runtime/chan.go:146 +0x2c
  github.com/vapourismo/knx-go/knx.(*Tunnel).handleConnStateRes.func1()
      /Users/tschinke/go/pkg/mod/github.com/vapourismo/knx-go@v0.0.0-20240529174927-d4625b0a8913/knx/tunnel.go:455 +0x100

Goroutine 26 (running) created at:
  github.com/vapourismo/knx-go/knx.NewTunnel()
      /Users/tschinke/go/pkg/mod/github.com/vapourismo/knx-go@v0.0.0-20240529174927-d4625b0a8913/knx/tunnel.go:616 +0x330
  github.com/vapourismo/knx-go/knx.NewGroupTunnel()
      /Users/tschinke/go/pkg/mod/github.com/vapourismo/knx-go@v0.0.0-20240529174927-d4625b0a8913/knx/tunnel.go:653 +0x70
  gitlab/group/wzo-bms/subnet-gateway/domain/connector/knxip/adapter/vapourismo.(*Broker).connect()
      /Users/tschinke/git/gitlab/group/wzo-bms/subnet-gateway.git/domain/connector/knxip/adapter/vapourismo/adapter.go:94 +0x208
  gitlab/group/wzo-bms/subnet-gateway/domain/connector/knxip/adapter/vapourismo.NewBroker.func1()
      /Users/tschinke/git/gitlab/group/wzo-bms/subnet-gateway.git/domain/connector/knxip/adapter/vapourismo/adapter.go:33 +0x34

Goroutine 3401 (running) created at:
  github.com/vapourismo/knx-go/knx.(*Tunnel).handleConnStateRes()
      /Users/tschinke/go/pkg/mod/github.com/vapourismo/knx-go@v0.0.0-20240529174927-d4625b0a8913/knx/tunnel.go:447 +0x1a0
  github.com/vapourismo/knx-go/knx.(*Tunnel).process()
      /Users/tschinke/go/pkg/mod/github.com/vapourismo/knx-go@v0.0.0-20240529174927-d4625b0a8913/knx/tunnel.go:531 +0x5bc
  github.com/vapourismo/knx-go/knx.(*Tunnel).serve()
      /Users/tschinke/go/pkg/mod/github.com/vapourismo/knx-go@v0.0.0-20240529174927-d4625b0a8913/knx/tunnel.go:554 +0x21c
  github.com/vapourismo/knx-go/knx.NewTunnel.gowrap1()
      /Users/tschinke/go/pkg/mod/github.com/vapourismo/knx-go@v0.0.0-20240529174927-d4625b0a8913/knx/tunnel.go:616 +0x34
==================

I'm not sure, if I'm doing something unexpected, but I don't believe so.

vapourismo commented 3 months ago

How can I reproduce this?

torbenschinke commented 2 months ago

I was able to reproduce it, using a minimal example and it seems to happen when closing the connection.

func main() {
    client, err := knx.NewGroupTunnel("10.6.6.140:3671", knx.DefaultTunnelConfig)
    if err != nil {
        panic(err)
    }

    go func() {
        for msg := range client.Inbound() {
            fmt.Println(msg.Command.String(), msg.Destination.String())
        }
    }()

    time.Sleep(20 * time.Second)
    client.Close()
}

Run it with enabled race detector:

go run -race testmod/knx/cmd
vapourismo commented 2 months ago

I can't reproduce this. I also struggle to understand what exactly the data race is in this case. It seems like it's a write paired with a read, that seems perfectly fine - especially since the read is allowed to fail.

I don't think I can help you without a lot more information and/or a MCVE.

torbenschinke commented 2 months ago

Did you really tried to reproduce it with the race detector enabled (the -race flag)? A race condition requires at least one groroutine who writes and another who reads. Without any kind of synchronization this will always result in a data race, so that is by definition never fine.

As far as I can see now is that this always happens if one go-routine writes to a channel and another go-routine closes it. This is exactly what the race detector traces show: in handleConnStateRes a goroutine is spawned to write into the heartbeat channel, however that channel is closed from a different goroutine by process.

See also this SO discussion.

vapourismo commented 2 months ago

What you're describing is not a race condition to me unless closing a channel is not atomic.

Also see this comment, we don't care about writing to closed channels: https://github.com/vapourismo/knx-go/blob/3b325e3f5dcf040dbe30c3ae657e004267b96ce6/knx/tunnel.go#L448

torbenschinke commented 2 months ago

Sorry, I don't want to be rude, but it is obviously a race condition. Its existence has been proofed alone by the fact, that the Go race detector has found it. There are no known false-positive cases. We now have the task to understand and explain the situation. Your recover logic and comment is just a sympton of a wrong channel usage. The issue is not about sending to a closed channel. The issue is, that you are using the channels in way, that they can race internally. What you are essentially doing is the following:

func main() {
    heartbeat := make(chan bool)
    defer close(heartbeat)
    defer func() {
        if r := recover(); r != nil {
            fmt.Println("Recovered in f", r)
            go func() {
                heartbeat <- true
            }()
        }
    }()
    for {
        select {
        case <-heartbeat:
        default:
            panic("panic")
        }
    }
}

When you run with the race detector enabled, you see the equal stack trace of the racing read- and writes in the channels code in the runtime itself:

Recovered in f panic
==================
WARNING: DATA RACE
Read at 0x00c00007e0d0 by goroutine 7:
  runtime.chansend1()
      /opt/homebrew/opt/go/libexec/src/runtime/chan.go:146 +0x2c

....

Previous write at 0x00c00007e0d0 by main goroutine:
  runtime.recvDirect()
      /opt/homebrew/opt/go/libexec/src/runtime/chan.go:348 +0x7c

....

I can just recommend that you take the time, to refresh your knowledge about race conditions and how the Go race detector works (see https://go.dev/blog/race-detector). It has no false-positive findings. Every data race it finds, is a real race. However, it also tells us that it cannot not find each race, but that is not the point here. I cannot tell, if this channel behavior is intentional by the Go language designers or not. But we can be sure, that this behavior is technically wrong, proofed by the observations of the race detector.

Very good readings specific about the Go Memory model has been created by one of the core developers: https://research.swtch.com/mm.

torbenschinke commented 2 months ago

Regarding

unless closing a channel is not atomic.

It is documented at the end of the article https://go.dev/doc/articles/race_detector that this is indeed not "atomic" in your understanding:

Unsynchronized send and close operations ... As this example demonstrates, unsynchronized send and close operations on the same channel can also be a race condition ...

Note also similar discussions in https://github.com/golang/go/issues/27769 or https://github.com/golang/go/issues/30372