golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.11k stars 17.68k forks source link

fmt: apparent race on fmt.pp buf #44672

Open myitcv opened 3 years ago

myitcv commented 3 years ago

What version of Go are you using (go version)?

$ go version
go version go1.16 linux/amd64

Does this issue reproduce with the latest release?

Yes, although I can't reproduce this.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/myitcv/.cache/go-build"
GOENV="/home/myitcv/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/myitcv/gostuff/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/myitcv/gostuff"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/myitcv/gos"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/myitcv/gos/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.16"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1396033664=/tmp/go-build -gno-record-gcc-switches"

What did you do?

This problem surfaced in a -race build for govim. The raw log is now stored here:

https://gist.github.com/myitcv/483887344dd4e48596ed6bb364d77651

The relevant part of the test failure is:

WARNING: DATA RACE
Write at 0x00c001f64300 by goroutine 208:
  runtime.slicecopy()
      /go/src/runtime/slice.go:247 +0x0
  fmt.(*buffer).write()
      /go/src/fmt/print.go:78 +0xb1
  fmt.(*pp).Write()
      /go/src/fmt/print.go:187 +0x2f
  text/tabwriter.(*Writer).write0()
      /go/src/text/tabwriter/tabwriter.go:254 +0x72
  text/tabwriter.(*Writer).writeLines()
      /go/src/text/tabwriter/tabwriter.go:321 +0x4a4
  text/tabwriter.(*Writer).format()
      /go/src/text/tabwriter/tabwriter.go:409 +0x570
  text/tabwriter.(*Writer).flushNoDefers()
      /go/src/text/tabwriter/tabwriter.go:520 +0x9c
  text/tabwriter.(*Writer).Write()
      /go/src/text/tabwriter/tabwriter.go:554 +0xb44
  github.com/kr/pretty.writeByte()
      /home/runner/gopath/pkg/mod/github.com/kr/pretty@v0.1.0/formatter.go:319 +0x2684
  github.com/kr/pretty.(*printer).printValue()
      /home/runner/gopath/pkg/mod/github.com/kr/pretty@v0.1.0/formatter.go:169 +0x260a
  github.com/kr/pretty.(*printer).printValue()
      /home/runner/gopath/pkg/mod/github.com/kr/pretty@v0.1.0/formatter.go:249 +0x1ae6
  github.com/kr/pretty.formatter.Format()
      /home/runner/gopath/pkg/mod/github.com/kr/pretty@v0.1.0/formatter.go:57 +0x224
  github.com/kr/pretty.(*formatter).Format()
      <autogenerated>:1 +0xac
  fmt.(*pp).handleMethods()
      /go/src/fmt/print.go:593 +0x1c7
  fmt.(*pp).printArg()
      /go/src/fmt/print.go:709 +0x1b8
  fmt.(*pp).doPrint()
      /go/src/fmt/print.go:1157 +0xe6
  fmt.Sprint()
      /go/src/fmt/print.go:249 +0x64
  github.com/kr/pretty.Sprint()
      /home/runner/gopath/pkg/mod/github.com/kr/pretty@v0.1.0/pretty.go:91 +0x77
  github.com/govim/govim/cmd/govim.loggingGoplsServer.Initialize()
      /home/runner/govim/cmd/govim/gopls_server.go:27 +0x239
  github.com/govim/govim/cmd/govim.(*loggingGoplsServer).Initialize()
      <autogenerated>:1 +0xae
  github.com/govim/govim/cmd/govim.(*govimplugin).startGopls()
      /home/runner/govim/cmd/govim/gopls.go:169 +0x18bc
  github.com/govim/govim/cmd/govim.(*govimplugin).Init()
      /home/runner/govim/cmd/govim/main.go:337 +0x486d
  github.com/govim/govim.(*govimImpl).load.func2()
      /home/runner/govim/govim.go:340 +0x33b
  github.com/govim/govim.(*govimImpl).DoProto()
      /home/runner/govim/channel_cmds.go:131 +0x83
  github.com/govim/govim.(*govimImpl).load()
      /home/runner/govim/govim.go:320 +0x97b
  github.com/govim/govim.(*govimImpl).load-fm()
      /home/runner/govim/govim.go:291 +0x4a
  github.com/govim/govim.(*govimImpl).goHandleShutdown.func1()
      /home/runner/govim/govim.go:283 +0xa9
  gopkg.in/tomb%2ev2.(*Tomb).run()
      /home/runner/gopath/pkg/mod/gopkg.in/tomb.v2@v2.0.0-20161208151619-d5d1b5820637/tomb.go:163 +0x51

Previous read at 0x00c001f64300 by goroutine 58:
  internal/race.ReadRange()
      /go/src/internal/race/race.go:45 +0xb0
  syscall.Write()
      /go/src/syscall/syscall_unix.go:215 +0x94
  internal/poll.ignoringEINTRIO()
      /go/src/internal/poll/fd_unix.go:581 +0x16e
  internal/poll.(*FD).Write()
      /go/src/internal/poll/fd_unix.go:274 +0x294
  os.(*File).write()
      /go/src/os/file_posix.go:48 +0xb1
  os.(*File).Write()
      /go/src/os/file.go:174 +0x6b
  os/exec.(*closeOnce).Write()
      <autogenerated>:1 +0x84
  io.WriteCloser.Write-fm()
      /go/src/io/io.go:139 +0x79
  github.com/govim/govim/cmd/govim/internal/golang_org_x_tools/fakenet.(*connFeeder).run()
      /home/runner/govim/cmd/govim/internal/golang_org_x_tools/fakenet/conn.go:121 +0x1d4

At the time of this data race, govim was running tests against the then tip of x/tools and gopls. The module state at the time of the test is captured in this branch:

https://github.com/govim/govim/tree/race_fail

What did you expect to see?

Passing tests.

What did you see instead?

Above.

From the quick analysis I've done thus far I don't think the problem can be in our code because the apparent race is happening on the buf field of a fmt.pp.

robpike commented 3 years ago

The fmt package has been tested and race-tested innumerable times. It could indeed be a bug in the library but I doubt it. It's much likelier to be a problem in your code or environment that reveals itself as this race.

That's not helpful, I know, but we would need a reproducible case to do any serious debugging of such a well-known and well-tested package.

ianlancetaylor commented 3 years ago

CC @dvyukov

I wonder if this could be due to the sync.Pool. The sync.Pool code has special code to avoid triggering the race detector on values put into the pool and then retrieved from the pool, but it's not clear to me that that that code handles a slice referenced by the object in the pool. I wonder if it is possible that the race detector is not detecting the synchronization implied by using the pool.

myitcv commented 3 years ago

The fmt package has been tested and race-tested innumerable times. It could indeed be a bug in the library but I doubt it. It's much likelier to be a problem in your code or environment that reveals itself as this race.

Thanks, @robpike. I used the word "apparent" in the title to try and convey the fact that we're not definitely looking at a race, but a situation the race detector thinks is a race (I didn't do that in the body of the description - now fixed). My analysis and assertion that the problem is not in our code might be wrong of course.

dvyukov commented 3 years ago

The first stack seems to be on the fmt internal buffer (which is either newly allocated memory, or come from sync.Pool, right?). But what's strange is that the second stack seems to be on some user slice here: https://github.com/govim/govim/blob/main/cmd/govim/internal/golang_org_x_tools/fakenet/conn.go#L121 which was received from f.input channel: https://github.com/govim/govim/blob/main/cmd/govim/internal/golang_org_x_tools/fakenet/conn.go#L116 Do internal fmt buffers ever leave fmt package? I don't see how this can happen. Say, fmt.Sprintf would return a copy of the internal buffer, right?

Is this race reliably reproducible? Does it always look the same way with the same stacks?

myitcv commented 3 years ago

Is this race reliably reproducible? Does it always look the same way with the same stacks?

It's the first time we've seen it unfortunately. Will obviously add more details here in case we find them.

myitcv commented 3 years ago

But what's strange is that the second stack seems to be on some user slice here

I'm not sure I can definitively say it is a user slice.

Do internal fmt buffers ever leave fmt package? I don't see how this can happen. Say, fmt.Sprintf would return a copy of the internal buffer, right?

Slightly stabbing in the dark (very much out of my depth), but is it possible that a compiler optimisation determined it was safe for an internal fmt buffer to leave the package at some point?

On a related topic, how does the race detector handle/work with compiler optimisations?

monkey92t commented 3 years ago

i also encountered a similar error, I don’t know if it’s the same problem

WARNING: DATA RACE
Read at 0x00c0000814c8 by goroutine 27:
  internal/race.ReadRange()
      /home/travis/.gimme/versions/go1.15.8.linux.amd64/src/internal/race/race.go:45 +0xb0
  syscall.Write()
      /home/travis/.gimme/versions/go1.15.8.linux.amd64/src/syscall/syscall_unix.go:215 +0x94
  internal/poll.(*FD).Write.func1()
      /home/travis/.gimme/versions/go1.15.8.linux.amd64/src/internal/poll/fd_unix.go:267 +0xf9
  internal/poll.ignoringEINTR()
      /home/travis/.gimme/versions/go1.15.8.linux.amd64/src/internal/poll/fd_unix.go:567 +0x38
  internal/poll.(*FD).Write()
      /home/travis/.gimme/versions/go1.15.8.linux.amd64/src/internal/poll/fd_unix.go:267 +0x20b
  net.(*netFD).Write()
      /home/travis/.gimme/versions/go1.15.8.linux.amd64/src/net/fd_posix.go:73 +0x65
  net.(*conn).Write()
      /home/travis/.gimme/versions/go1.15.8.linux.amd64/src/net/net.go:194 +0xeb
  net.(*TCPConn).Write()
      <autogenerated>:1 +0x69
..........more omit

Previous write at 0x00c0000814c8 by goroutine 79:
  time.NewTicker()
      /home/travis/.gimme/versions/go1.15.8.linux.amd64/src/time/tick.go:35 +0x12a
  github.com/go-redis/redis/v8/internal/pool.(*ConnPool).reaper()
      /home/travis/gopath/src/github.com/go-redis/redis/internal/pool/pool.go:449 +0x55
-----no more

//pool.go

//448 func (p *ConnPool) reaper(frequency time.Duration) {
//449   ticker := time.NewTicker(frequency)
//450   defer ticker.Stop()
                .....
//}

why do conn.Write([]byte) and time.NewTicker produce data race? i don’t know the conditions under which they appear, but i often encounter such errors.

dvyukov commented 3 years ago

@monkey92t if you provide full report and git revisions used, it will be more useful. Currently it's unclear what calls TCPConn.Write.

monkey92t commented 3 years ago

@monkey92t if you provide full report and git revisions used, it will be more useful. Currently it's unclear what calls TCPConn.Write.

@dvyukov travis job log: https://travis-ci.org/github/go-redis/redis/jobs/761788172

dvyukov commented 3 years ago

@monkey92t if you provide full report and git revisions used, it will be more useful. Currently it's unclear what calls TCPConn.Write.

@dvyukov travis job log: https://travis-ci.org/github/go-redis/redis/jobs/761788172

At first glance it does not look related. So my first guess would be it's just a bug in that code.

monkey92t commented 3 years ago

@monkey92t if you provide full report and git revisions used, it will be more useful. Currently it's unclear what calls TCPConn.Write.

@dvyukov travis job log: https://travis-ci.org/github/go-redis/redis/jobs/761788172

At first glance it does not look related. So my first guess would be it's just a bug in that code.

I have investigated many times and found that there is no relationship between the data. I don’t know much about the working principle of race. Why does it report two unrelated data errors?

myitcv commented 3 years ago

@monkey92t - I'd agree with @dvyukov, your situation does not look related to this apparent race condition.

If you think there is an underlying Go issue that's causing your race condition I'd suggested opening another issue.

dvyukov commented 3 years ago

@monkey92t The first stack is reading cmd.Args here: https://github.com/go-redis/redis/blob/f3a31a3e2c9bb40351bef1d02ee56069eff21d79/command.go#L59

I could only track creation of cmd.Args up to here: https://github.com/go-redis/redis/blob/f3a31a3e2c9bb40351bef1d02ee56069eff21d79/pipeline.go#L112-L115

The second stack is writing to w.lenBuf here: https://github.com/go-redis/redis/blob/f3a31a3e2c9bb40351bef1d02ee56069eff21d79/internal/proto/writer.go#L55

w.lenBuf is allocated in this package and used in very restricted way: https://github.com/go-redis/redis/blob/f3a31a3e2c9bb40351bef1d02ee56069eff21d79/internal/proto/writer.go#L31

Is my reading correct?

dvyukov commented 3 years ago

Subsequent races look even more insane, with some timer internal. There are also lots of mentions of github.com/go-redis/redis/v8/internal/pool which uses sync.Pool. Maybe a coincidence, or maybe not... If it's not reproducible (is it?), it will be very hard to debug. If there is suspicion on sync.Pool, I would probably start with some super stress test that would try to expose potential sync.Pool problems as much as possible.

monkey92t commented 3 years ago

I was skeptical of lenBuff at first, but then I discovered that they are different new(Struct). I can't make them appear 100%, but I often encounter them, I can try to delete internal/pool.go:sync.Pool{} to see if it alleviates the problem.

dvyukov commented 3 years ago

I can try to delete internal/pool.go:sync.Pool{} to see if it alleviates the problem.

Yes, it would be useful. If we could rule out sync.Pool, it would be already something.

myitcv commented 3 years ago

We've just seen another instance of this (govim commit 2c8a2ba49ed0170bfad80633bd87414b6ec9cffd and gopls/x/tools 8e4f4c86593a for reference)

Write at 0x00c0003e2780 by goroutine 98:
  runtime.slicestringcopy()
      /go/src/runtime/slice.go:281 +0x0
  fmt.(*buffer).writeString()
      /go/src/fmt/print.go:82 +0x155
  fmt.(*pp).doPrintf()
      /go/src/fmt/print.go:991 +0xac
  fmt.Sprintf()
      /go/src/fmt/print.go:219 +0x73
  github.com/govim/govim.(*govimImpl).Logf()
      /home/runner/govim/govim.go:944 +0x89
  github.com/govim/govim.(*govimImpl).logVimEventf()
      /home/runner/govim/govim.go:940 +0x3da
  github.com/govim/govim.(*govimImpl).run()
      /home/runner/govim/govim.go:431 +0x2b9
  github.com/govim/govim.(*govimImpl).Run.func1()
      /home/runner/govim/govim.go:403 +0x44
  github.com/govim/govim.(*govimImpl).DoProto()
      /home/runner/govim/channel_cmds.go:131 +0x83
  github.com/govim/govim.(*govimImpl).Run()
      /home/runner/govim/govim.go:402 +0x64
  github.com/govim/govim/testdriver.(*TestDriver).runGovim()
      /home/runner/govim/testdriver/testdriver.go:468 +0xe1
  github.com/govim/govim/testdriver.(*TestDriver).runGovim-fm()
      /home/runner/govim/testdriver/testdriver.go:466 +0x44
  github.com/govim/govim/testdriver.(*TestDriver).tombgo.func1()
      /home/runner/govim/testdriver/testdriver.go:410 +0x61
  gopkg.in/tomb%2ev2.(*Tomb).run()
      /home/runner/gopath/pkg/mod/gopkg.in/tomb.v2@v2.0.0-20161208151619-d5d1b5820637/tomb.go:163 +0x51

Previous read at 0x00c0003e2780 by goroutine 242:
  internal/race.ReadRange()
      /go/src/internal/race/race.go:45 +0xb0
  syscall.Write()
      /go/src/syscall/syscall_unix.go:215 +0x94
  internal/poll.(*FD).Write.func1()
      /go/src/internal/poll/fd_unix.go:267 +0xf9
  internal/poll.ignoringEINTR()
      /go/src/internal/poll/fd_unix.go:567 +0x38
  internal/poll.(*FD).Write()
      /go/src/internal/poll/fd_unix.go:267 +0x20b
  os.(*File).write()
      /go/src/os/file_posix.go:48 +0xb1
  os.(*File).Write()
      /go/src/os/file.go:173 +0x6b
  os/exec.(*closeOnce).Write()
      <autogenerated>:1 +0x84
  io.WriteCloser.Write-fm()
      /go/src/io/io.go:133 +0x68
  github.com/govim/govim/cmd/govim/internal/golang_org_x_tools/fakenet.(*connFeeder).run()
      /home/runner/govim/cmd/govim/internal/golang_org_x_tools/fakenet/conn.go:121 +0x1d7

It seems unlikely, but is the race detector getting confused about the fact the []byte was passed over a channel? The call to Write, the ultimate source of the the []byte in this case, is still blocked at this point, but the file write (to os.Stdout) is happening on another go routine.

dvyukov commented 3 years ago

It seems unlikely, but is the race detector getting confused about the fact the []byte was passed over a channel?

Yes, I agree it seems unlikely. We have unit-tests for channels and I think we would see way more races and much more reproducible if channels synchronization would broke.

monkey92t commented 3 years ago

@dvyukov I once again encountered a very puzzled data race, It seems that they write to the same memory address (I don't know very well). they should never produce a data race.

atomic.StoreUint32() data race variable := &struct{}

I don’t know if it helps you: https://github.com/go-redis/redis/issues/1657#issuecomment-801783814

AlexanderYastrebov commented 3 years ago

Running

func Test44672(t *testing.T) {
    x := []byte("test")

    go func() {
        for {
            Sprintf("%s", x)
        }
    }()

    for {
        x[0] = 0
    }
}

I get very close stacktrace so it looks to me that the formatted value is modified in another sproc.

ianlancetaylor commented 3 years ago

@AlexanderYastrebov That program has a race condition. One goroutine reads x while another writes to x[0]. The race detector should report an error for that program regardless of this issue.