Open EVODelavega opened 4 years ago
That looks like a race between reading the length of the channel and writing to the channel. That looks like a correct report to me.
@randall77 Thanks for looking at this. Commenting out the assertion on the channel length doesn't change things, so I don't think that's what is causing the report. I could be wrong, obviously, but considering what statements are being referred to in the report, len(cCh)
doesn't show up
It would help if you could provide complete stand-alone code for the problem, and if the race report that you show corresponds to that code.
That said, I see that (*Broker).Send
starts a goroutine. In TestRace
I see that the code calls (*Broker).Send
three times. That starts three goroutines. If I'm reading your mocking code correctly, it calls wg.Done
as soon as the calling code calls s.C
, which is to say as soon as the goroutine enters the select statement. So the goroutine calls wg.Done
when it enters the select statement, but before it chooses any case from the select statement. Then TestRace
reads from the channel. While I can't claim to understand this code, it seems possible that after reading from the channel, the select statement sends to the channel. And then TestRace
closes the channel. There is no synchronization between that send and that close, so you get a race report.
The len(cCh)
is the access that caused the race in the report you showed us. At least, assuming broker_test.go
you gave us and stuff_test.go
you ran with are identical.
If there is still a race without that line, show us that updated race report.
(It will probably be on the close(cCh)
because that also races with the channel send.)
@ianlancetaylor The mock .Do()
function that calls wg.Done()
is invoked after the mock has returned the channel. I've created a repo with the code so it's easy to recreate the issue: https://github.com/EVODelavega/go-race
The channel is closed after the wg is done, so after the channel was returned to the 3 routines. The routines run sequentially by definition (each routine requires a mutex lock), so only the last call to Send
could be the time a data race happens. Then again, the wg.Wait
call makes sure that the routine has indeed called sub.C()
. Between my waiting for the waitgroup, and my closing the channel, I close the done channel, read from the channel, check its length, etc... and still, the data race is reported. The only way to get around it is to make an explicit call to the broker, removing the mock, which acquires a lock, and thus wait for the go routine to return.
@randall77 As you can see in the repo I created: the code to reproduce the issue is indeed identical to what I posted here. the len(cCh)
statement is 100% not a data race, just looking at the language specifications:
A single channel may be used in send statements, receive operations, and calls to the built-in functions cap and len by any number of goroutines without further synchronization. Channels act as first-in-first-out queues. For example, if one goroutine sends values on a channel and a second goroutine receives them, the values are received in the order sent.
checking length and cap of a channel doesn't require synchronisation, so there is no data race possible there. The closing of the channel is where the race detector takes issue with, unless I remove the mock from the broker first. The issue I have is that: the behaviour of the unit tests, and indeed the broker are 100% deterministic. I use the waitgroup for synchronisation, and close(dCh)
is not flagged up. Of course, the routines only read from the done channel, but looking at the coverage rapport: this case is never selected anyway, and instead I see the broker outputs "Skipped broker 1" twice (which tells me the select
executed the default
case).
Try as I like, to me there's only 2 ways around this issue:
wg.Wait()
statement, I have to make a second call to the broker (Unsubscribe
or Subscribe
), which acquires a mutex lock, ensuring the routines have returned, or call Send
a fourth time after closing the dCh
. In both cases, it really does feel like I'm appeasing the race detector, because -as I keep saying- the behaviour of the code is deterministic to the best of my knowledge.the len(cCh) statement is 100% not a data race, just looking at the language specifications:
You're right, this is not a race. Why then did it report the line number of the len call?
Your repro reports the close
call on the next line.
I agree with Ian. The compiler converts:
select {
case ...:
case s.C() <- v:
}
To
tmp := s.C()
select {
case ...:
case tmp <- v:
}
In between the assignment and the select
statement, the wait group can be decremented to 0 (inside the 3rd C
call), allowing the main routine to run to the close
statement and cause a race with the select
case.
I've gone over this so many times, thinking I must've missed something, but I do believe I have found a case where the race detector returns a false positive (ie data race where there really isn't a data race). It seems to be something that happens when writing to a channel in a
select-case
statement directly.The unit tests trigger the race detector, even though I'm ensuring all calls accessing the channel have been made using a callback and a waitgroup.
I have the channels in a map, which I access through a mutex. The data race vanishes the moment I explicitly remove the type that holds the channel from this map. The only way I am able to do is because the mutex is released, so once again: I'm certain everything behaves correctly. Code below
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
I'm writing a simple message event bus where the broker pushes data onto a channel of its subscribers/consumers. If the channel buffer is full, I don't want the broker to block, so I'm using routines, and a
select
statement to skip writes to a channel with a full buffer. To make life easier WRT testing, I'm mocking a subscriber interface, and I'm exposing the channels through functions (similar tocontext.Context.Done()
and the like).My tests all pass, and everything behaves as expected. However, running the same tests with the race detector, I'm getting what I believe to be a false positive. I have a test where I send data to a subscriber that isn't consuming the messages. The channel buffer is full, and I want to ensure that the broker doesn't block. To make sure I've tried to send all data, I'm using a waitgroup to check if the subscriber has indeed been accessed N number of times (where N is the number of events I'm sending). Once the waitgroup is done, I validate what data is on the channel, make sure it's empty, and then close it. The statement where I close the channel is marked as a data race.
If I do the exact same thing, but remove the subscriber from the broker, the data race magically is no more. Here's the code to reproduce the issue:
broker.go
broker_test.go
See the data race by running:
go test -v -race ./broker/... -run TestRace
What did you expect to see?
I expect to see log output showing that the subscriber was skipped twice (output I do indeed see), and no data race
What did you see instead?
I still saw the code behaved as expected, but I do see a data race reported:
Though I'm not certain, my guess is that the expression
s.C() <- v
, because it's a case expression, is what trips the race detector up here. The channel buffer is full, so any writes would be blocking if I'd put the channel write in thedefault
case. As it stands, the write cannot possibly be executed, so instead my code logs the fact that a subscriber is being skipped, the routine ends (defer func unlocks the mutex), and the mock callback decrements the waitgroup. Once the waitgroup is empty, all calls to my mock subscriber have been made, and the channel can be safely closed.It seems, however, that I need to add the additional call, removing the mock from the broker to "reset" the race detector state. I'll try and have a look at the source, maybe something jumps out.