Open myitcv opened 3 years ago
The fmt package has been tested and race-tested innumerable times. It could indeed be a bug in the library but I doubt it. It's much likelier to be a problem in your code or environment that reveals itself as this race.
That's not helpful, I know, but we would need a reproducible case to do any serious debugging of such a well-known and well-tested package.
CC @dvyukov
I wonder if this could be due to the sync.Pool
. The sync.Pool
code has special code to avoid triggering the race detector on values put into the pool and then retrieved from the pool, but it's not clear to me that that that code handles a slice referenced by the object in the pool. I wonder if it is possible that the race detector is not detecting the synchronization implied by using the pool.
The fmt package has been tested and race-tested innumerable times. It could indeed be a bug in the library but I doubt it. It's much likelier to be a problem in your code or environment that reveals itself as this race.
Thanks, @robpike. I used the word "apparent" in the title to try and convey the fact that we're not definitely looking at a race, but a situation the race detector thinks is a race (I didn't do that in the body of the description - now fixed). My analysis and assertion that the problem is not in our code might be wrong of course.
The first stack seems to be on the fmt internal buffer (which is either newly allocated memory, or come from sync.Pool, right?). But what's strange is that the second stack seems to be on some user slice here: https://github.com/govim/govim/blob/main/cmd/govim/internal/golang_org_x_tools/fakenet/conn.go#L121 which was received from f.input channel: https://github.com/govim/govim/blob/main/cmd/govim/internal/golang_org_x_tools/fakenet/conn.go#L116 Do internal fmt buffers ever leave fmt package? I don't see how this can happen. Say, fmt.Sprintf would return a copy of the internal buffer, right?
Is this race reliably reproducible? Does it always look the same way with the same stacks?
Is this race reliably reproducible? Does it always look the same way with the same stacks?
It's the first time we've seen it unfortunately. Will obviously add more details here in case we find them.
But what's strange is that the second stack seems to be on some user slice here
I'm not sure I can definitively say it is a user slice.
Do internal fmt buffers ever leave fmt package? I don't see how this can happen. Say, fmt.Sprintf would return a copy of the internal buffer, right?
Slightly stabbing in the dark (very much out of my depth), but is it possible that a compiler optimisation determined it was safe for an internal fmt
buffer to leave the package at some point?
On a related topic, how does the race detector handle/work with compiler optimisations?
i also encountered a similar error, I don’t know if it’s the same problem
WARNING: DATA RACE
Read at 0x00c0000814c8 by goroutine 27:
internal/race.ReadRange()
/home/travis/.gimme/versions/go1.15.8.linux.amd64/src/internal/race/race.go:45 +0xb0
syscall.Write()
/home/travis/.gimme/versions/go1.15.8.linux.amd64/src/syscall/syscall_unix.go:215 +0x94
internal/poll.(*FD).Write.func1()
/home/travis/.gimme/versions/go1.15.8.linux.amd64/src/internal/poll/fd_unix.go:267 +0xf9
internal/poll.ignoringEINTR()
/home/travis/.gimme/versions/go1.15.8.linux.amd64/src/internal/poll/fd_unix.go:567 +0x38
internal/poll.(*FD).Write()
/home/travis/.gimme/versions/go1.15.8.linux.amd64/src/internal/poll/fd_unix.go:267 +0x20b
net.(*netFD).Write()
/home/travis/.gimme/versions/go1.15.8.linux.amd64/src/net/fd_posix.go:73 +0x65
net.(*conn).Write()
/home/travis/.gimme/versions/go1.15.8.linux.amd64/src/net/net.go:194 +0xeb
net.(*TCPConn).Write()
<autogenerated>:1 +0x69
..........more omit
Previous write at 0x00c0000814c8 by goroutine 79:
time.NewTicker()
/home/travis/.gimme/versions/go1.15.8.linux.amd64/src/time/tick.go:35 +0x12a
github.com/go-redis/redis/v8/internal/pool.(*ConnPool).reaper()
/home/travis/gopath/src/github.com/go-redis/redis/internal/pool/pool.go:449 +0x55
-----no more
//pool.go
//448 func (p *ConnPool) reaper(frequency time.Duration) {
//449 ticker := time.NewTicker(frequency)
//450 defer ticker.Stop()
.....
//}
why do conn.Write([]byte)
and time.NewTicker
produce data race?
i don’t know the conditions under which they appear, but i often encounter such errors.
@monkey92t if you provide full report and git revisions used, it will be more useful. Currently it's unclear what calls TCPConn.Write.
@monkey92t if you provide full report and git revisions used, it will be more useful. Currently it's unclear what calls TCPConn.Write.
@dvyukov travis job log: https://travis-ci.org/github/go-redis/redis/jobs/761788172
@monkey92t if you provide full report and git revisions used, it will be more useful. Currently it's unclear what calls TCPConn.Write.
@dvyukov travis job log: https://travis-ci.org/github/go-redis/redis/jobs/761788172
At first glance it does not look related. So my first guess would be it's just a bug in that code.
@monkey92t if you provide full report and git revisions used, it will be more useful. Currently it's unclear what calls TCPConn.Write.
@dvyukov travis job log: https://travis-ci.org/github/go-redis/redis/jobs/761788172
At first glance it does not look related. So my first guess would be it's just a bug in that code.
I have investigated many times and found that there is no relationship between the data. I don’t know much about the working principle of race. Why does it report two unrelated data errors?
@monkey92t - I'd agree with @dvyukov, your situation does not look related to this apparent race condition.
If you think there is an underlying Go issue that's causing your race condition I'd suggested opening another issue.
@monkey92t The first stack is reading cmd.Args here: https://github.com/go-redis/redis/blob/f3a31a3e2c9bb40351bef1d02ee56069eff21d79/command.go#L59
I could only track creation of cmd.Args up to here: https://github.com/go-redis/redis/blob/f3a31a3e2c9bb40351bef1d02ee56069eff21d79/pipeline.go#L112-L115
The second stack is writing to w.lenBuf here: https://github.com/go-redis/redis/blob/f3a31a3e2c9bb40351bef1d02ee56069eff21d79/internal/proto/writer.go#L55
w.lenBuf is allocated in this package and used in very restricted way: https://github.com/go-redis/redis/blob/f3a31a3e2c9bb40351bef1d02ee56069eff21d79/internal/proto/writer.go#L31
Is my reading correct?
Subsequent races look even more insane, with some timer internal. There are also lots of mentions of github.com/go-redis/redis/v8/internal/pool which uses sync.Pool. Maybe a coincidence, or maybe not... If it's not reproducible (is it?), it will be very hard to debug. If there is suspicion on sync.Pool, I would probably start with some super stress test that would try to expose potential sync.Pool problems as much as possible.
I was skeptical of lenBuff at first, but then I discovered that they are different new(Struct). I can't make them appear 100%, but I often encounter them, I can try to delete internal/pool.go:sync.Pool{} to see if it alleviates the problem.
I can try to delete internal/pool.go:sync.Pool{} to see if it alleviates the problem.
Yes, it would be useful. If we could rule out sync.Pool, it would be already something.
We've just seen another instance of this (govim
commit 2c8a2ba49ed0170bfad80633bd87414b6ec9cffd and gopls
/x/tools
8e4f4c86593a for reference)
Write at 0x00c0003e2780 by goroutine 98:
runtime.slicestringcopy()
/go/src/runtime/slice.go:281 +0x0
fmt.(*buffer).writeString()
/go/src/fmt/print.go:82 +0x155
fmt.(*pp).doPrintf()
/go/src/fmt/print.go:991 +0xac
fmt.Sprintf()
/go/src/fmt/print.go:219 +0x73
github.com/govim/govim.(*govimImpl).Logf()
/home/runner/govim/govim.go:944 +0x89
github.com/govim/govim.(*govimImpl).logVimEventf()
/home/runner/govim/govim.go:940 +0x3da
github.com/govim/govim.(*govimImpl).run()
/home/runner/govim/govim.go:431 +0x2b9
github.com/govim/govim.(*govimImpl).Run.func1()
/home/runner/govim/govim.go:403 +0x44
github.com/govim/govim.(*govimImpl).DoProto()
/home/runner/govim/channel_cmds.go:131 +0x83
github.com/govim/govim.(*govimImpl).Run()
/home/runner/govim/govim.go:402 +0x64
github.com/govim/govim/testdriver.(*TestDriver).runGovim()
/home/runner/govim/testdriver/testdriver.go:468 +0xe1
github.com/govim/govim/testdriver.(*TestDriver).runGovim-fm()
/home/runner/govim/testdriver/testdriver.go:466 +0x44
github.com/govim/govim/testdriver.(*TestDriver).tombgo.func1()
/home/runner/govim/testdriver/testdriver.go:410 +0x61
gopkg.in/tomb%2ev2.(*Tomb).run()
/home/runner/gopath/pkg/mod/gopkg.in/tomb.v2@v2.0.0-20161208151619-d5d1b5820637/tomb.go:163 +0x51
Previous read at 0x00c0003e2780 by goroutine 242:
internal/race.ReadRange()
/go/src/internal/race/race.go:45 +0xb0
syscall.Write()
/go/src/syscall/syscall_unix.go:215 +0x94
internal/poll.(*FD).Write.func1()
/go/src/internal/poll/fd_unix.go:267 +0xf9
internal/poll.ignoringEINTR()
/go/src/internal/poll/fd_unix.go:567 +0x38
internal/poll.(*FD).Write()
/go/src/internal/poll/fd_unix.go:267 +0x20b
os.(*File).write()
/go/src/os/file_posix.go:48 +0xb1
os.(*File).Write()
/go/src/os/file.go:173 +0x6b
os/exec.(*closeOnce).Write()
<autogenerated>:1 +0x84
io.WriteCloser.Write-fm()
/go/src/io/io.go:133 +0x68
github.com/govim/govim/cmd/govim/internal/golang_org_x_tools/fakenet.(*connFeeder).run()
/home/runner/govim/cmd/govim/internal/golang_org_x_tools/fakenet/conn.go:121 +0x1d7
It seems unlikely, but is the race detector getting confused about the fact the []byte
was passed over a channel? The call to Write
, the ultimate source of the the []byte
in this case, is still blocked at this point, but the file write (to os.Stdout
) is happening on another go routine.
It seems unlikely, but is the race detector getting confused about the fact the []byte was passed over a channel?
Yes, I agree it seems unlikely. We have unit-tests for channels and I think we would see way more races and much more reproducible if channels synchronization would broke.
@dvyukov I once again encountered a very puzzled data race, It seems that they write to the same memory address (I don't know very well). they should never produce a data race.
atomic.StoreUint32()
data race variable := &struct{}
I don’t know if it helps you: https://github.com/go-redis/redis/issues/1657#issuecomment-801783814
Running
func Test44672(t *testing.T) {
x := []byte("test")
go func() {
for {
Sprintf("%s", x)
}
}()
for {
x[0] = 0
}
}
I get very close stacktrace so it looks to me that the formatted value is modified in another sproc.
@AlexanderYastrebov That program has a race condition. One goroutine reads x
while another writes to x[0]
. The race detector should report an error for that program regardless of this issue.
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes, although I can't reproduce this.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
This problem surfaced in a
-race
build forgovim
. The raw log is now stored here:https://gist.github.com/myitcv/483887344dd4e48596ed6bb364d77651
The relevant part of the test failure is:
At the time of this data race,
govim
was running tests against the then tip ofx/tools
andgopls
. The module state at the time of the test is captured in this branch:https://github.com/govim/govim/tree/race_fail
What did you expect to see?
Passing tests.
What did you see instead?
Above.
From the quick analysis I've done thus far I don't think the problem can be in our code because the apparent race is happening on the
buf
field of afmt.pp
.