golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.65k stars 17.61k forks source link

runtime: ThreadSanitizer CHECK failed #27660

Closed benesch closed 5 years ago

benesch commented 6 years ago

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

$ go version
go version go1.11 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/benesch/.cache/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/benesch/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build445193843=/tmp/go-build -gno-record-gcc-switches"

What did you do?

go get -u github.com/cockroachdb/cockroach
cd $(go env GOPATH)/src/github.com/cockroachdb/cockroach
make testrace PKG=./pkg/sql/logictest TESTS=TestLogic

This builds some C and C++ dependencies and ultimately runs:

go test -race -tags ' make x86_64_linux_gnu' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v2.2.0-alpha.00000000-793-g33c7d27d82-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=33c7d27d8216b543ac77f0fe39d440ebebfa9e70" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=x86_64-linux-gnu"  ' -run "TestLogic"  -timeout 25m ./pkg/sql/logictest 

What did you expect to see?

Either a successful test run or an actual race.

What did you see instead?

FATAL: ThreadSanitizer CHECK failed: ./gotsan.cc:380 "((v)) > ((0))" (0x0, 0x0)

I tried to dig into this, but gotsan.cc is apparently generated deep inside the build process for the race detector and it wasn't clear to me how to generate a gotsan.cc that matched my Go version.

Totally possible that this our (CockroachDB's) bug, but there's not a lot here for me to go on, so figured I'd file here and see if someone with more expertise in the race detector had any insight.

benesch commented 6 years ago

And here's a different tsan failure I just observed with the same command:

FATAL: ThreadSanitizer CHECK failed: ./gotsan.cc:828 "((get_block(bi))) == ((0))" (0x603f, 0x0)
benesch commented 6 years ago

And another slightly different presentation:

0x00000000026cb4e6 in __tsan::SyncClock::Unshare(__tsan::DenseSlabAllocCache*) ()
(gdb) bt
#0  0x00000000026cb4e6 in __tsan::SyncClock::Unshare(__tsan::DenseSlabAllocCache*) ()
#1  0x00000000026cb773 in __tsan::ThreadClock::ReleaseStore(__tsan::DenseSlabAllocCache*, __tsan::SyncClock*) ()
#2  0x00000000026dd43a in __tsan_go_atomic32_compare_exchange ()
#3  0x000000000062cfc3 in racecall () at /usr/local/go/src/runtime/race_amd64.s:381
#4  0x00007ffff6a906d0 in ?? ()
ianlancetaylor commented 6 years ago

CC @dvyukov

dvyukov commented 6 years ago

These instructions fail for me:

cockroach$ go test -race -tags ' make x86_64_linux_gnu' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v2.2.0-alpha.00000000-793-g33c7d27d82-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=33c7d27d8216b543ac77f0fe39d440ebebfa9e70" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=x86_64-linux-gnu"  ' -run "TestLogic"  -timeout 25m ./pkg/sql/logictest 
# github.com/cockroachdb/cockroach/pkg/sql/lex
pkg/sql/lex/predicates.go:60:14: undefined: reservedKeywords
FAIL    github.com/cockroachdb/cockroach/pkg/sql/logictest [build failed]

This is on commit:

cockroach$ git show HEAD
commit 418b9169b547069cd43252055196204af8f18040 (HEAD -> master, origin/staging, origin/master, origin/HEAD)
Merge: a9c7295dd6 5504e3f9ae
Author: craig[bot] <bors@cockroachlabs.com>
Date:   Fri Sep 14 01:24:32 2018 +0000

    Merge #29163
benesch commented 6 years ago

Yes, I’m afraid you’ll need to run the make test command I posted at least once to generate some Go code and build some C/C++ dependencies. After that, you can run just the go test command.

I apologize that the surface area of this reproduction is so large, but I really don’t know how to whittle it down. We run the race detector on every other CockroachDB package with no problems. Let me know if there’s anything I can do to assist in tracking this down!

On Fri, Sep 14, 2018 at 6:00 AM Dmitry Vyukov notifications@github.com wrote:

These instructions fail for me:

cockroach$ go test -race -tags ' make x86_64_linux_gnu' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v2.2.0-alpha.00000000-793-g33c7d27d82-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=33c7d27d8216b543ac77f0fe39d440ebebfa9e70" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=x86_64-linux-gnu" ' -run "TestLogic" -timeout 25m ./pkg/sql/logictest

github.com/cockroachdb/cockroach/pkg/sql/lex

pkg/sql/lex/predicates.go:60:14 http://github.com/cockroachdb/cockroach/pkg/sql/lexpkg/sql/lex/predicates.go:60:14: undefined: reservedKeywords FAIL github.com/cockroachdb/cockroach/pkg/sql/logictest [build failed]

This is on commit:

cockroach$ git show HEAD commit 418b9169b547069cd43252055196204af8f18040 (HEAD -> master, origin/staging, origin/master, origin/HEAD) Merge: a9c7295dd6 5504e3f9ae Author: craig[bot] bors@cockroachlabs.com Date: Fri Sep 14 01:24:32 2018 +0000

Merge #29163

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/27660#issuecomment-421301789, or mute the thread https://github.com/notifications/unsubscribe-auth/AA15IByVPvtHvHaZVvkSEownHzmlif_tks5ua348gaJpZM4Wn-Lc .

dvyukov commented 6 years ago

I did that. The error I posted happens during go test.

benesch commented 6 years ago

Are you sure the make testrace command succeeded? It spews a lot of output from parallel commands so the error can really get buried. I just created a fresh VM and verified that the instructions I posted work with commit 418b9169, so I'm not sure what else it could be. I can give you access to this VM if that would be helpful.

dvyukov commented 6 years ago

Right, it failed somewhere in the beginning:

cockroach$ IGNORE_GOVERS=1 make testrace PKG=./pkg/sql/logictest TESTS=TestLogic IGNORE_GOVERS=1
Running make with -j4
GOPATH set to /go
Detected change in build system. Rebooting Make.
Running make with -j4
GOPATH set to /go
cd /go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc && autoconf
rm -rf /go/native/x86_64-linux-gnu/protobuf
bash: autoconf: command not found
Makefile:511: recipe for target '/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/configure' failed
make: *** [/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/configure] Error 127
make: *** Waiting for unfinished jobs....
go install -v uptodate
bin/prereqs ./pkg/cmd/uptodate > bin/uptodate.d.tmp
mkdir -p /go/native/x86_64-linux-gnu/protobuf
cd /go/native/x86_64-linux-gnu/protobuf && cmake -DCMAKE_TARGET_MESSAGES=OFF   -Dprotobuf_BUILD_TESTS=OFF /go/src/github.com/cockroachdb/cockroach/c-deps/protobuf/cmake \
  -DCMAKE_BUILD_TYPE=Release
mv -f bin/uptodate.d.tmp bin/uptodate.d
github.com/cockroachdb/cockroach/vendor/github.com/MichaelTJones/walk
-- The C compiler identification is GNU 7.3.0
github.com/cockroachdb/cockroach/vendor/github.com/spf13/pflag
github.com/cockroachdb/cockroach/pkg/cmd/uptodate
-- The CXX compiler identification is GNU 7.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Looking for pthread_create
-- Looking for pthread_create - not found
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.8") 
-- Configuring done
-- Generating done
-- Build files have been written to: /go/native/x86_64-linux-gnu/protobuf
benesch commented 6 years ago

Oh, yeah, you’ll need autoconf and CMake installed. It looks like you have CMake but not autoconf.

On Fri, Sep 14, 2018 at 9:43 AM Dmitry Vyukov notifications@github.com wrote:

Right, it failed somewhere in the beginning:

cockroach$ IGNORE_GOVERS=1 make testrace PKG=./pkg/sql/logictest TESTS=TestLogic IGNORE_GOVERS=1 Running make with -j4 GOPATH set to /go Detected change in build system. Rebooting Make. Running make with -j4 GOPATH set to /go cd /go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc && autoconf rm -rf /go/native/x86_64-linux-gnu/protobuf bash: autoconf: command not found Makefile:511: recipe for target '/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/configure' failed make: [/go/src/github.com/cockroachdb/cockroach/c-deps/jemalloc/configure] Error 127 make: Waiting for unfinished jobs.... go install -v uptodate bin/prereqs ./pkg/cmd/uptodate > bin/uptodate.d.tmp mkdir -p /go/native/x86_64-linux-gnu/protobuf cd /go/native/x86_64-linux-gnu/protobuf && cmake -DCMAKE_TARGET_MESSAGES=OFF -Dprotobuf_BUILD_TESTS=OFF /go/src/github.com/cockroachdb/cockroach/c-deps/protobuf/cmake \ -DCMAKE_BUILD_TYPE=Release mv -f bin/uptodate.d.tmp bin/uptodate.dgithub.com/cockroachdb/cockroach/vendor/github.com/MichaelTJones/walk -- The C compiler identification is GNU 7.3.0github.com/cockroachdb/cockroach/vendor/github.com/spf13/pflaggithub.com/cockroachdb/cockroach/pkg/cmd/uptodate -- The CXX compiler identification is GNU 7.3.0 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found ZLIB: /usr/lib/x86_64-linux-gnu/libz.so (found version "1.2.8") -- Configuring done -- Generating done -- Build files have been written to: /go/native/x86_64-linux-gnu/protobuf

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/golang/go/issues/27660#issuecomment-421362135, or mute the thread https://github.com/notifications/unsubscribe-auth/AA15INHhn9MKCMjIVRqDwGFG2k8XPr6aks5ua7J4gaJpZM4Wn-Lc .

dvyukov commented 6 years ago

How long does it generally take to crash? I've run it twice and both times got:

cockroach$ go test -race -tags ' make x86_64_linux_gnu' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v2.2.0-alpha.00000000-793-g33c7d27d82-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=33c7d27d8216b543ac77f0fe39d440ebebfa9e70" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=x86_64-linux-gnu"  ' -run "TestLogic"  -timeout 25m ./pkg/sql/logictest 
I180914 14:16:05.303611 1 rand.go:75  Random seed: -5069250136795247745
E180914 14:19:23.006360 77 sql/logictest/logic.go:2228  

testdata/logic_test/crdb_internal:295: SELECT node_id, network, regexp_replace(address, '\d+$', '<port>') as address, attrs, locality, regexp_replace(server_version, '^\d+\.\d+(-\d+)?$', '<server_version>') as server_version, regexp_replace(go_version, '^go.+$', '<go_version>') as go_version
FROM crdb_internal.kv_node_status WHERE node_id = 1
expected:
    node_id  network  address           attrs  locality            server_version    go_version
    1        tcp      127.0.0.1:<port>  []     {"region": "test"}  <server_version>  <go_version>

but found (query options: "colnames") :
    node_id  network  address           attrs  locality            server_version    go_version
    1        tcp      127.0.0.1:<port>  []     {"region": "test"}  <server_version>  devel +9f59918cae Thu Sep 13 21:34:56 2018 +0000

E180914 14:19:23.006776 77 sql/logictest/logic.go:2259  testdata/logic_test/crdb_internal:302: too many errors encountered, skipping the rest of the input
E180914 14:19:32.896862 50624 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:49210,user=root,scExec] error setting up flow: column "d" does not exist
E180914 14:19:34.520305 50624 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:49210,user=root,scExec] error setting up flow: error type checking constant value: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:19:35.740206 50624 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:49210,user=root,scExec] error setting up flow: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:19:36.830273 50624 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:49210,user=root,scExec] error setting up flow: unsupported binary operator: <int> / <int> (desired <int>)
E180914 14:24:01.186527 68088 storage/consistency_queue.go:128  [n1,consistencyChecker,s1,r1/1:/M{in-ax}] computing own checksum: could not dial node ID 1: initial connection heartbeat failed: rpc error: code = Canceled desc = context canceled
E180914 14:24:01.186855 68088 storage/queue.go:788  [n1,consistencyChecker,s1,r1/1:/M{in-ax}] computing own checksum: could not dial node ID 1: initial connection heartbeat failed: rpc error: code = Canceled desc = context canceled
E180914 14:26:08.285725 96089 storage/queue.go:788  [n1,raftlog,s1,r1/1:/M{in-ax}] result is ambiguous (server shutdown)
E180914 14:26:15.409641 64257 sql/logictest/logic.go:2228  

testdata/logic_test/crdb_internal:295: SELECT node_id, network, regexp_replace(address, '\d+$', '<port>') as address, attrs, locality, regexp_replace(server_version, '^\d+\.\d+(-\d+)?$', '<server_version>') as server_version, regexp_replace(go_version, '^go.+$', '<go_version>') as go_version
FROM crdb_internal.kv_node_status WHERE node_id = 1
expected:
    node_id  network  address           attrs  locality            server_version    go_version
    1        tcp      127.0.0.1:<port>  []     {"region": "test"}  <server_version>  <go_version>

but found (query options: "colnames") :
    node_id  network  address           attrs  locality            server_version    go_version
    1        tcp      127.0.0.1:<port>  []     {"region": "test"}  <server_version>  devel +9f59918cae Thu Sep 13 21:34:56 2018 +0000

E180914 14:26:15.410048 64257 sql/logictest/logic.go:2259  testdata/logic_test/crdb_internal:302: too many errors encountered, skipping the rest of the input
E180914 14:26:27.787026 99569 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:36516,user=root,scExec] error setting up flow: column "d" does not exist
E180914 14:26:29.171953 99569 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:36516,user=root,scExec] error setting up flow: error type checking constant value: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:26:30.247536 99569 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:36516,user=root,scExec] error setting up flow: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:26:31.426094 99569 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:36516,user=root,scExec] error setting up flow: unsupported binary operator: <int> / <int> (desired <int>)
E180914 14:37:20.722139 157227 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:35082,user=root,scExec] error setting up flow: column "d" does not exist
E180914 14:37:22.977964 157227 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:35082,user=root,scExec] error setting up flow: error type checking constant value: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:37:26.350066 157227 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:35082,user=root,scExec] error setting up flow: could not parse "a" as type int: strconv.ParseInt: parsing "a": invalid syntax
E180914 14:37:28.390702 157227 sql/distsqlrun/server.go:440  [n1,client=127.0.0.1:35082,user=root,scExec] error setting up flow: unsupported binary operator: <int> / <int> (desired <int>)
panic: test timed out after 25m0s

goroutine 175652 [running]:
testing.(*M).startAlarm.func1()
    /src/go/src/testing/testing.go:1303 +0x13a
created by time.goFunc
    /src/go/src/time/sleep.go:172 +0x52

goroutine 1 [chan receive, 25 minutes]:
testing.(*T).Run(0xc00034e300, 0x2ee939f, 0x9, 0x3010de0, 0xc0006b7bd0)
    /src/go/src/testing/testing.go:886 +0x689
testing.runTests.func1(0xc00034e300)
    /src/go/src/testing/testing.go:1126 +0xa9
testing.tRunner(0xc00034e300, 0xc0006b7d40)
    /src/go/src/testing/testing.go:834 +0x163
testing.runTests(0xc000299ba0, 0x4f9b7a0, 0x5, 0x5, 0x0)
    /src/go/src/testing/testing.go:1124 +0x4ef
testing.(*M).Run(0xc0004c8580, 0x0)
    /src/go/src/testing/testing.go:1041 +0x2ef
github.com/cockroachdb/cockroach/pkg/sql/logictest.TestMain(0xc0004c8580)
    /go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/main_test.go:37 +0x15b
main.main()
    _testmain.go:48 +0x222
benesch commented 6 years ago

It's been reproducing for me in less than a minute. Oddly enough, it's failing to reproduce for me on the latest master. Could you try commit 33c7d27d8216b543ac77f0fe39d440ebebfa9e70? And just to double check, you are running go1.11, right? I'm also having trouble reproducing on go1.10.3.

In the meantime I'm going to bisect to see if there's something obviously wrong that we fixed.

benesch commented 6 years ago

Bisecting proved challenging thanks to some actual races that were introduced and fixed within the range of commits I was bisecting. Regardless, I can semi-reliably repro on 33c7d27d8216b543ac77f0fe39d440ebebfa9e70 on a fresh Ubuntu machine with Go 1.11.

The problem seems to occur early or not at all. In five runs, this is what I observed:

  1. Test timed out after 2m.
  2. ThreadSanitizer CHECK failed: ./gotsan.cc:746 "((v)) > ((0))" (0x0, 0x0) after 54.42s.
  3. Test timed out after 2m.
  4. ThreadSanitizer CHECK failed: ./gotsan.cc:380 "((v)) > ((0))" (0x0, 0x0) after 113.751s.
  5. Test timed out after 2m.

I could have sworn this reproduced more reliably when I original filed the bug. Still, I'm hopeful that you might be able to reproduce if you run a few more trials with a short timeout. The magic incantation to thread in a timeout via the Makefile, if that's easier, is:

$ make testrace PKG=./pkg/sql/logictest TESTS=TestLogic TESTTIMEOUT=2m
dvyukov commented 6 years ago

I have no luck reproducing this so far:

logictest$ go version
go version go1.11 linux/amd64
logictest$ stress -failure="ThreadSanitizer" -p=10 -timeout=2m ./logictest.test -test.run=TestLogic -test.timeout=2m
...
70 runs so far, 0 failures

p.s. stress is: https://github.com/golang/tools/blob/master/cmd/stress/stress.go

dvyukov commented 6 years ago

My bets would be on:

  1. Memory corruption in native code (there is some, right). Can make sense to test with ASan/TSan/MSan.
  2. Kernel bug.
benesch commented 6 years ago

Oh, cool, I didn't know about the -failure flag to stress.

Anyway, thanks again for looking into this. We upgrade our CI environment to go1.11 today and various other tests started producing assertions like this under the race detector. We also started seeing outright segfaults:

unexpected fault address 0x27400
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x27400 pc=0x2bf0df6]

goroutine 1487 [running]:
runtime: unexpected return pc for runtime.throw called from 0x361d16b
stack: frame={sp:0x7fc94e9fe990, fp:0x7fc94e9fe9c0} stack=[0xc000ee3000,0xc000ee3800)

I bisected this down to a commit in Go itself: 8994444491d984491d8070ed30aeffc1abe344a2, titled "runtime/race: update most syso files to compiler-rt". That sure looks like our culprit, though I'm willing to believe we're still doing something wrong (e.g., corrupting memory, like you suggest) and the update to the race detector just happened to tickle this preexisting condition.

I'm going to try to get somewhere by bisecting compiler-rt itself. I may need your help in producing those syso files.

benesch commented 6 years ago

Whew, that was an exercise in patience. Bisecting was complicated by the fact that a number of revisions of compiler-rt between 68e15324 and fe2c72c5 did not compile with my GCC toolchain. Regardless, I managed to pinpoint the problem to this commit: https://github.com/llvm-mirror/compiler-rt/commit/12d16901ebba5bc7185298e9efa3c4d265fa52ce. Still unclear what the actual problem is. I'm not in the position to understand that tsan code, I'm afraid.

By the way, I'm now using this to repro:

make stressrace PKG=./pkg/ccl/backupccl TESTS=TestBackupRestoreResume STRESSFLAGS="-p 8" IGNORE_GOVERS=1 GO=~/dev/go/bin/go

The -p 8 avoids overloading my 24 CPU VM; each test execution seems to require about 3 CPUs. The IGNORE_GOVERS=1 GO=~/dev/go/bin/go is only necessary because I'm pointing at a custom version of Go. I've found that this reliably repros in under 5m.

I am once again feeling that this is more likely to be a bug in tsan than in CockroachDB. @dvyukov let me know if you disagree. My offer of providing you a VM that can reproduce this issue out of the box still stands.

dvyukov commented 6 years ago

I am once again feeling that this is more likely to be a bug in tsan than in CockroachDB.

Maybe. Hard to say. What's strange is that nobody reported any corruptions before and that I can't reproduce this on my machine. Can anybody else reproduce this?

benesch commented 6 years ago

Can anybody else reproduce this?

I'm going to check if any of my fellow Linux-using engineers can reproduce this issue on their laptop. Everyone else uses the same VM configuration (Ubuntu 18.04 on Google Compute Engine), so it's not all that interesting that they can reproduce. Here's the output of uname -r:

Linux gceworker-benesch 4.15.0-1018-gcp #19-Ubuntu SMP Thu Aug 16 13:38:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Are you willing to share the output of uname -r on your machine?

benesch commented 6 years ago

Ok, @mberhault was able to reproduce on his Linux desktop running Ubuntu 16.04 (kernel 4.4.0-135-generic) using the make stressrace PKG=./pkg/ccl/backupccl... with go1.11 and the latest CockroachDB master.

dvyukov commented 6 years ago

Running make stressrace PKG=./pkg/ccl/backupccl... 3 times I got:

unexpected fault address 0x2600
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x2600 pc=0x2bff9a6]

goroutine 5587 [running]:
runtime: unexpected return pc for runtime.throw called from 0x360c2ed
stack: frame={sp:0x7f7e059ec9b0, fp:0x7f7e059ec9e0} stack=[0xc0048c4000,0xc0048cc000)

runtime.throw(0x3, 0x3)
    src/runtime/panic.go:608 +0x72 fp=0x7f7e059ec9e0 sp=0x7f7e059ec9b0 pc=0x697d72
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
    gopath/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xc4

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000244200, 0x36521a6, 0x23, 0x374daf0, 0xc0007c1b01)
    src/testing/testing.go:879 +0x689
testing.runTests.func1(0xc000244200)
    src/testing/testing.go:1119 +0xa9
testing.tRunner(0xc000244200, 0xc0007c1d40)
    src/testing/testing.go:827 +0x163
testing.runTests(0xc0004de4e0, 0x5bacac0, 0x2c, 0x2c, 0x0)
    src/testing/testing.go:1117 +0x4ef
testing.(*M).Run(0xc0007de000, 0x0)
    src/testing/testing.go:1034 +0x2ef
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.TestMain(0xc0007de000)
    gopath/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/main_test.go:31 +0x17d
main.main()
    _testmain.go:138 +0x222

unexpected fault address 0xe400
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0xe400 pc=0x2bff519]

goroutine 7540 [running]:
runtime: unexpected return pc for runtime.throw called from 0x4
stack: frame={sp:0x7f46ad9fe930, fp:0x7f46ad9fe960} stack=[0xc0036da000,0xc0036fa000)

runtime.throw(0x4, 0x7f46ad9fe9a0)
    src/runtime/panic.go:608 +0x72 fp=0x7f46ad9fe960 sp=0x7f46ad9fe930 pc=0x697d72
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).serveImpl
    gopath/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:267 +0x13f4

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000232100, 0x36521a6, 0x23, 0x374daf0, 0xc0007f1b01)
    src/testing/testing.go:879 +0x689
testing.runTests.func1(0xc000232100)
    src/testing/testing.go:1119 +0xa9
testing.tRunner(0xc000232100, 0xc0007f1d40)
    src/testing/testing.go:827 +0x163
testing.runTests(0xc0002af640, 0x5bacac0, 0x2c, 0x2c, 0x0)
    src/testing/testing.go:1117 +0x4ef
testing.(*M).Run(0xc00058e000, 0x0)
    src/testing/testing.go:1034 +0x2ef
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.TestMain(0xc00058e000)
    gopath/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/main_test.go:31 +0x17d
main.main()
    _testmain.go:138 +0x222

unexpected fault address 0x26a00
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x26a00 pc=0x2bff519]

goroutine 6520 [running]:
runtime: unexpected return pc for runtime.throw called from 0x4
stack: frame={sp:0x7fec833fe930, fp:0x7fec833fe960} stack=[0xc002736000,0xc002738000)

runtime.throw(0x4, 0x7fec833fe9a0)
    src/runtime/panic.go:608 +0x72 fp=0x7fec833fe960 sp=0x7fec833fe930 pc=0x697d72
created by github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*Server).handleRawConn
    gopath/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/server.go:638 +0x7c2

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000754000, 0x36521a6, 0x23, 0x374daf0, 0xc0007cdb01)
    src/testing/testing.go:879 +0x689
testing.runTests.func1(0xc000754000)
    src/testing/testing.go:1119 +0xa9
testing.tRunner(0xc000754000, 0xc0007cdd40)
    src/testing/testing.go:827 +0x163
testing.runTests(0xc00031b840, 0x5bacac0, 0x2c, 0x2c, 0x0)
    src/testing/testing.go:1117 +0x4ef
testing.(*M).Run(0xc00074a100, 0x0)
    src/testing/testing.go:1034 +0x2ef
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.TestMain(0xc00074a100)
    gopath/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/main_test.go:31 +0x17d
main.main()
    _testmain.go:138 +0x222
dvyukov commented 6 years ago

@benesch do you know if native part of cockroachdb was testing with asan/tsan/msan? If not, that would be the first thing I would do.

benesch commented 6 years ago

Ok, great, you can reproduce!

Do you know how to read that stack trace you posted?

unexpected fault address 0x2600
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x2600 pc=0x2bff9a6]

goroutine 5587 [running]:
runtime: unexpected return pc for runtime.throw called from 0x360c2ed
stack: frame={sp:0x7f7e059ec9b0, fp:0x7f7e059ec9e0} stack=[0xc0048c4000,0xc0048cc000)

runtime.throw(0x3, 0x3)
    src/runtime/panic.go:608 +0x72 fp=0x7f7e059ec9e0 sp=0x7f7e059ec9b0 pc=0x697d72
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
    gopath/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:192 +0xc4

I don't understand how we go from stopper.go:192 (which launches a goroutine) to runtime.throw. (Other segfaults I've triggered in Go are understandable, by contrast.) I suppose the answer is probably memory corruption?

@benesch do you know if native part of cockroachdb was testing with asan/tsan/msan? If not, that would be the first thing I would do.

It's not tested with any of the sanitizers, unfortunately. Most of our serious testing is done in Go, so we exercise the C/C++ code via cgo. That makes it tricky to use any of the sanitizers. I've tried to get msan to work but it triggers a number of false positives due to structs with padding: https://github.com/golang/go/issues/26167. I suppose it should be possible to use asan and tsan on our native code by plumbing in -fsanitize during compilation. I'll see if I can find some time to do that today.

That said, there are only four C/C++ libraries that we link: snappy, protobuf, RocksDB, and jemalloc. I've verified that this reproduces without jemalloc, so I think we can rule that one out. Snappy and protobuf are both used extensively at Google, so I'd be surprised if they were to blame. RocksDB is extremely complicated, but they test upstream with asan/tsan/ubsan. So that leaves our first-party C/C++ that glues everything together. It's certainly possible that we've had a lurking memory bug there, but a) the code's not that complicated, and b) I feel like we'd have seen it crop up somewhere before. This is all speculation, of course.

leventeliu commented 5 years ago

Hi, @dvyukov . Any update on this issue? We got a similar problem in our project after we updated go to version 1.11. And I just confirm that this issue exists in 1.11.2 too.

dvyukov commented 5 years ago

@leventeliu update on what exactly you mean? It looks like the app has memory corruption. I don't plan to do anything with that.

I noticed that I missed a question:

Do you know how to read that stack trace you posted?

It says that there was a paging fault when some code tried to access address 0x2600. The bad accesses happened in non-Go code, Go unwinder could not show a trace (maybe it's untraceable code at all). It also says that the goroutine where the crash happened was created by stop.(*Stopper).RunWorker. That's pretty much it.

benesch commented 5 years ago

It says that there was a paging fault when some code tried to access address 0x2600. The bad accesses happened in non-Go code, Go unwinder could not show a trace (maybe it's untraceable code at all). It also says that the goroutine where the crash happened was created by stop.(*Stopper).RunWorker. That's pretty much it.

Oh, I see. I was misreading the line under "created at" as part of the fault backtrace, not as the backtrace of the creation of the goroutine. Thanks.

Any update on this issue? We got a similar problem in our project after we updated go to version 1.11. And I just confirm that this issue exists in 1.11.2 too.

@leventeliu do you see exactly the same sorts of internal tsan assertion failures? And is your project public? If so I'd love to take a look. CockroachDB is large enough that narrowing this down has proven extremely difficult.

Specifically, if the failures in your project disappear if you build a version of the race detector with https://github.com/llvm-mirror/compiler-rt/commit/12d16901ebba5bc7185298e9efa3c4d265fa52ce reverted, that's additional support for the theory that the problem is in tsan. If not, then it's likely that your project and CockroachDB both have rare memory corruption issues.

Building Go with a custom race detector is non-trivial, so I'm happy to do the experiment for you if your sources are public.

benesch commented 5 years ago

I was confused about why that stack trace from a few comments ago was so indecipherable given that the PC implicated in the segfault occurred was clearly in ThreadSanitizer code. After some digging I realized that Go butchers the stack on a segfault: https://github.com/golang/go/blob/9f78aa77443389d61a3ef910a1c3d4628999d82c/src/runtime/signal_unix.go#L358-L364

I guess this is nice when it works, but it's really getting in the way here. I built a custom version of Go with this feature disabled:

diff --git a/src/runtime/sigtab_linux_generic.go b/src/runtime/sigtab_linux_generic.go
index b26040b803..d02225201b 100644
--- a/src/runtime/sigtab_linux_generic.go
+++ b/src/runtime/sigtab_linux_generic.go
@@ -22,7 +22,7 @@ var sigtable = [...]sigTabT{
        /* 8 */ {_SigPanic + _SigUnblock, "SIGFPE: floating-point exception"},
        /* 9 */ {0, "SIGKILL: kill"},
        /* 10 */ {_SigNotify, "SIGUSR1: user-defined signal 1"},
-       /* 11 */ {_SigPanic + _SigUnblock, "SIGSEGV: segmentation violation"},
+       /* 11 */ {0, "SIGSEGV: segmentation violation"},
        /* 12 */ {_SigNotify, "SIGUSR2: user-defined signal 2"},
        /* 13 */ {_SigNotify, "SIGPIPE: write to broken pipe"},
        /* 14 */ {_SigNotify, "SIGALRM: alarm clock"},

And voila! Now GDB has a better stack trace:

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000002db2719 in __tsan::SyncClock::Resize(__tsan::DenseSlabAllocCache*, unsigned long) ()
(gdb) bt
#0  0x0000000002db2719 in __tsan::SyncClock::Resize(__tsan::DenseSlabAllocCache*, unsigned long) ()
#1  0x0000000002db2b23 in __tsan::SyncClock::Unshare(__tsan::DenseSlabAllocCache*) ()
#2  0x0000000002db2e33 in __tsan::ThreadClock::ReleaseStore(__tsan::DenseSlabAllocCache*, __tsan::SyncClock*) ()
#3  0x0000000002db3e7c in __tsan::ReleaseStore(__tsan::ThreadState*, unsigned long, unsigned long) ()
#4  0x00000000006ce983 in racecall () at /home/benesch/dev/go/src/runtime/race_amd64.s:381
#5  0x00007fb7abcf73d0 in ?? ()
#6  0x0000000000800000 in regexp.(*machine).alloc (i=0x7fb7877fe700, m=0x0) at /home/benesch/dev/go/src/regexp/exec.go:100
#7  regexp.(*machine).add (cap=..., cond=0 '\000', m=0x0, pc=3496985344, pos=0, q=0x0, t=0x0, ~r6=<optimized out>) at /home/benesch/dev/go/src/regexp/exec.go:296
#8  0x0000000000000000 in ?? ()

I suspect frames 5 - 8 are garbage; GDB seems to have trouble unwinding through racecall. Next stop is trying to get line numbers on the ThreadSanitizer frames.

benesch commented 5 years ago

Ok, plumbing a -g into the build process for the .syso seems to have worked. Here's a more complete stack trace:

#0  __tsan::SyncClock::Unshare (this=0x7f94eea294f8, c=0x7f95088b2920) at ./gotsan.cc:773
#1  0x0000000002db3fff in __tsan::ThreadClock::ReleaseStore (this=0x7f94e2aa8c60, c=<optimized out>, dst=0x7f94eea294f8) at ./gotsan.cc:565
#2  0x0000000002dc615f in AtomicCAS<unsigned int> (mo=__tsan::mo_acq_rel, fmo=__tsan::mo_acquire, v=<optimized out>, c=<synthetic pointer>, a=0xc00425c840, pc=7135484, 
    thr=0x7f94e2aa8c08) at ./gotsan.cc:1550
#3  AtomicCAS<unsigned int> (fmo=__tsan::mo_acquire, mo=__tsan::mo_acq_rel, v=<optimized out>, c=0, a=0xc00425c840, pc=7135484, thr=0x7f94e2aa8c08) at ./gotsan.cc:1573
#4  __tsan_go_atomic32_compare_exchange (thr=0x7f94e2aa8c08, cpc=<optimized out>, pc=7135484, a=0xc002ec1bd8 "@\310%\004\300") at ./gotsan.cc:2077
#5  0x00000000006ce223 in racecall () at /home/benesch/dev/go/src/runtime/race_amd64.s:381

I put the corresponding gotsan.cc here: https://gist.github.com/benesch/b62181af5aa6d3d7efa0761e34e7df68.

benesch commented 5 years ago

Another crash, this time on line gotsan.cc:570. The common theme is iterating over the elements of a SyncClock.

leventeliu commented 5 years ago

@benesch Yes, we are having the exact same FATAL while testing go-sqlite3, which is a sqlite driver implemented via cgo. And it can be easily reproduced with stress tool at any Go1.11.x version. Here is our public project. Go to xenomint/sqlite, run go test -v -race -o sqlite_test.o and stress ./sqlite_test.o, and see what happens. I have done some digging on this, but totally have no clue to it.

benesch commented 5 years ago

Excellent, I can reproduce. It's definitely the same problem. Thanks so much for reporting, @leventeliu! This is strong evidence that the bug is in either the Go runtime or ThreadSanitizer itself.

@dvyukov FYI this new reproduction also disappears if I build the Go race detector with https://github.com/llvm-mirror/compiler-rt/commit/12d16901ebba5bc7185298e9efa3c4d265fa52ce reverted.

dvyukov commented 5 years ago

Thanks for all the debugging. I will take a closer look. But I am travelling for few next weeks, so can't promise when exactly it will happen.

benesch commented 5 years ago

Gotcha, no problem. I'll keep digging in the meantime. On a hunch I disabled clock sharing in ThreadSanitizer with the following diff:

diff --git a/lib/tsan/rtl/tsan_clock.cc b/lib/tsan/rtl/tsan_clock.cc
index ef984a45c..0e7f21ba0 100644
--- a/lib/tsan/rtl/tsan_clock.cc
+++ b/lib/tsan/rtl/tsan_clock.cc
@@ -238,7 +238,7 @@ void ThreadClock::ReleaseStore(ClockCache *c, SyncClock *dst) {
   DCHECK_LE(dst->size_, kMaxTid);
   CPP_STAT_INC(StatClockStore);

-  if (dst->size_ == 0 && cached_idx_ != 0) {
+  if (dst->size_ == 0 && cached_idx_ != 0 && false) {
     // Reuse the cached clock.
     // Note: we could reuse/cache the cached clock in more cases:
     // we could update the existing clock and cache it, or replace it with the
@@ -296,7 +296,7 @@ void ThreadClock::ReleaseStore(ClockCache *c, SyncClock *dst) {

   // If the resulting clock is cachable, cache it for future release operations.
   // The clock is always cachable if we released to an empty sync object.
-  if (cached_idx_ == 0 && dst->Cachable()) {
+  if (cached_idx_ == 0 && dst->Cachable() && false) {
     // Grab a reference to the ClockBlock.
     atomic_uint32_t *ref = ref_ptr(dst->tab_);
     if (atomic_load(ref, memory_order_acquire) == 1)

and the problem disappeared! So my best guess is that ThreadSanitizer is somehow corrupting its own internal state when it shares clocks.

benesch commented 5 years ago

Indeed, this seems to be a straightforward double free. I'm going to try to write up a targeted test case and figure out how to submit a patch to LLVM.

dvyukov commented 5 years ago

You debugged the root cause? That was brave!

You already know how to build it, right? Here are some docs on contributing: https://github.com/google/sanitizers/wiki/AddressSanitizerHowToContribute https://llvm.org/docs/Contributing.html https://llvm.org/docs/Phabricator.html

There is probably lots of irrelevant stuff. The main part is to send a Phabricator review.

dvyukov commented 5 years ago

And a test would be good.

benesch commented 5 years ago

I think I spoke too soon. I thought the bug was a simple reference counting error with shared SyncClocks, but the problem appears to be deeper than that. In the course of my investigation I added a bit of logging to the DenseSlabAllocator. Here's a snapshot of the period before an assertion failure:

clock allocator: alloc 2522 (curproc 0x00000279c798, tid e7fc700, pos 60)
clock allocator: free  4442 (curproc 0x00000279c798, tid a1ffb700, pos 60)
clock allocator: free  4441 (curproc 0x00000279c798, tid a1ffb700, pos 61)
clock allocator: free  4440 (curproc 0x00000279c798, tid a1ffb700, pos 62)
clock allocator: free  4439 (curproc 0x00000279c798, tid a1ffb700, pos 63)
clock allocator: alloc 4441 (curproc 0x00000279c798, tid e7fc700, pos 61)
clock allocator: free  4443 (curproc 0x00000279c798, tid a1ffb700, pos 64)
clock allocator: alloc 4443 (curproc 0x00000279c798, tid e7fc700, pos 64)
clock allocator: alloc 4439 (curproc 0x00000279c798, tid e7fc700, pos 63)
clock allocator: alloc 4440 (curproc 0x00000279c798, tid e7fc700, pos 62)
clock allocator: alloc 4441 (curproc 0x00000279c798, tid e7fc700, pos 61)
clock allocator: alloc 5257 (curproc 0x00000279da58, tid 42ffd700, pos 9)
FATAL: ThreadSanitizer CHECK failed: ../rtl/tsan_dense_alloc.h:86 "((idx)) != ((0))" (0x0, 0x0)

Notice how slot 4441 is allocated twice without an intervening free! This is a guaranteed disaster. curproc is the address of the proc struct returned by get_cur_proc(), while tid is the opaque ID reported pthread_self().

The problem seems to be that the proc struct 0x0279c798, which contains a non-threadsafe allocator cache, is simultaneously in use by goroutines e7fc700 and a1ffb700. (It's possible that those two goroutines were quickly scheduled and rescheduled on the same pthread, but I doubt it.)

So is it possible that the Go runtime is sometimes failing to update the proc struct when it moves a goroutine between pthreads? That's my best guess as to what's happening. It explains why this badness has only been observed in Go programs, and not any C++ programs.

Disclaimer that I'm not 100% confident in my instrumentation, so I might be a little bit off here.

benesch commented 5 years ago

I gathered some more evidence for the theory that the Go runtime is sharing processor data across multiple threads. I started by annotating several of the calls into tsan. Here's a snippet.

diff --git a/lib/tsan/rtl/tsan_rtl.h b/lib/tsan/rtl/tsan_rtl.h
index 5e2a745c9..130b33609 100644
--- a/lib/tsan/rtl/tsan_rtl.h
+++ b/lib/tsan/rtl/tsan_rtl.h
@@ -356,6 +356,7 @@ struct Processor {
   DenseSlabAllocCache sync_cache;
   DenseSlabAllocCache clock_cache;
   DDPhysicalThread *dd_pt;
+  bool inuse;
 };

 #if !SANITIZER_GO

diff --git a/lib/tsan/go/tsan_go.cc b/lib/tsan/go/tsan_go.cc
index 71a660683..ba0dd4a08 100644
--- a/lib/tsan/go/tsan_go.cc
+++ b/lib/tsan/go/tsan_go.cc
@@ -231,48 +231,84 @@ void __tsan_proc_destroy(Processor *proc) {
 }

 void __tsan_acquire(ThreadState *thr, void *addr) {
+  CHECK(!thr->proc()->inuse);
+  thr->proc()->inuse = true;
   Acquire(thr, 0, (uptr)addr);
+  CHECK(thr->proc()->inuse);
+  thr->proc()->inuse = false;
 }

The full diff is here: https://gist.github.com/benesch/cb3258a3eb4b573b5a3b7891db000e66

The reproductions of the original bug now fail with

FATAL: ThreadSanitizer CHECK failed: ./gotsan.cc:234 "((!thr->proc()->inuse)) != (0)" (0x0, 0x0)

instead of the various assertion errors they would previously fail with. That strongly implies that the root cause of all the assertion failures is the incorrect sharing of thr->proc(). And programs which did not trigger the previous assertions do not trigger this new assertion.

I've started poking around the Go scheduler but I'm way out of my depth here. I'll note that this diff to disable stealing of Ps in syscalls

diff --git a/src/runtime/proc.go b/src/runtime/proc.go
index f82014eb92..5993e2a6da 100644
--- a/src/runtime/proc.go
+++ b/src/runtime/proc.go
@@ -1081,7 +1081,7 @@ func stopTheWorldWithSema() {
    // try to retake all P's in Psyscall status
    for _, p := range allp {
        s := p.status
-       if s == _Psyscall && atomic.Cas(&p.status, s, _Pgcstop) {
+       if s == _Psyscall && false && atomic.Cas(&p.status, s, _Pgcstop) {
            if trace.enabled {
                traceGoSysBlock(p)
                traceProcStop(p)
@@ -1455,7 +1455,7 @@ func forEachP(fn func(*p)) {
    // off to induce safe point function execution.
    for _, p := range allp {
        s := p.status
-       if s == _Psyscall && p.runSafePointFn == 1 && atomic.Cas(&p.status, s, _Pidle) {
+       if s == _Psyscall && false && p.runSafePointFn == 1 && atomic.Cas(&p.status, s, _Pidle) {
            if trace.enabled {
                traceGoSysBlock(p)
                traceProcStop(p)
@@ -4457,7 +4457,7 @@ func retake(now int64) uint32 {
        }
        pd := &_p_.sysmontick
        s := _p_.status
-       if s == _Psyscall {
+       if s == _Psyscall && false {
            // Retake P from syscall if it's there for more than 1 sysmon tick (at least 20us).
            t := int64(_p_.syscalltick)
            if int64(pd.syscalltick) != t {

vastly decreases the probability that xenomint/sqlite will hit the tsan assertion. (I initially thought it had fixed the problem, but alas.) With the tsan patch but not the Go scheduler patch, running xenomint/sqlite's tests under race reliably hits the tsan assertion on every run. With both patches, I managed to get 187 runs before it failed. I'm not sure what to make of that.

dvyukov commented 5 years ago

If 2 threads use the same Processor, it would explain all kinds of weird memory corruptions in tsan runtime.

thr->proc() obtains current Processor from Go runtime as g.m.p.racectx. The firing CHECK means that either 2 goroutines run with the same g.m, or 2 m's runtime with the same m.p. Both of which looks badly wrong.

I am thinking of some way to check this condition reliably in Go runtime and catch it es early as possible. But so far have not come up with anything.

dvyukov commented 5 years ago

This does not happen with Go1.10, right? Then one option is to bisect Go from Go1.10 to Go1.11 to find the root cause.

benesch commented 5 years ago

This does not happen with Go1.10, right? Then one option is to bisect Go from Go1.10 to Go1.11 to find the root cause.

Actually, I think this does still happen with Go 1.10, just far, far less frequently. I checked my recollection with @petermattis today, and we remembered seeing spurious failures from race-enabled CockroachDB builds at least as far back as early 2017. The difference was that those failures would only show up once every few weeks, back when we used to run a several-node CockroachDB cluster with race binaries. As far as I can remember, the symptoms were the same: various assertion failures inside tsan that were rather inexplicable.

I'll double-check now.

benesch commented 5 years ago

Yep, xenomint/sqlite blows up on go1.10 if you add the additional assertions to the race detector. Looks like this has been a latent bug in the Go runtime.

benesch commented 5 years ago

Perhaps this isn't so bad. On a hunch, I added a bit more information to the new assertions:

diff --git a/lib/tsan/go/tsan_go.cc b/lib/tsan/go/tsan_go.cc
index 5f2507b7d..d3345c6b7 100644
--- a/lib/tsan/go/tsan_go.cc
+++ b/lib/tsan/go/tsan_go.cc
 void __tsan_acquire(ThreadState *thr, void *addr) {
+  auto inuse_goid = thr->proc()->inuse_goid;
+  if (inuse_goid != 0) {
+    __sanitizer::Printf("tsan_acquire(%p): proc in use by %d (this goid: %d)\n", addr, inuse_goid, thr->goid);
+    __sanitizer::Die();
+  }
+  thr->proc()->inuse_goid = thr->goid;
   Acquire(thr, 0, (uptr)addr);
+  CHECK_EQ(thr->proc()->inuse_goid, thr->goid);
+  thr->proc()->inuse_goid = 0;
 }

Then I noticed that all of the failures seemed to be triggered by a call to __tsan_acquire:

tsan_acquire(0x000005035be8): proc in use by 48 (this goid: 49)

Even more suspiciously, every failure was an acquisition of the exactly the same memory address, 0x5035be8. Kind of seems like a global variable...

$ nm sqlite.test | grep 000005035be8
0000000005035be8 s runtime.racecgosync

Looks like cgocall is just calling raceacquire(&racecgosync) before it allocates a P (via exitsyscall), and that's not valid.

I'm hopeful that I've actually found the root cause this time.

dvyukov commented 5 years ago

Just a minute ago I started looking at:

func endcgo(mp *m) {
    mp.incgo = false
    mp.ncgo--

    if raceenabled {
        raceacquire(unsafe.Pointer(&racecgosync))
    }
}

:)

dvyukov commented 5 years ago

It needs to do raceacquire after exitsyscall. I think we beaten by the dangling g.m.p pointer left in reentersyscall.

dvyukov commented 5 years ago

cgocallbackg is also badly broken.

dvyukov commented 5 years ago

@benesch do you want to fix yourself? Since you already done so much work on this, don't want to steal it from you.

I think we also need to change reentersyscall to leave the dangling p pointer in some shadow variable instead of the g.m.p. g.m.p set makes impression that m and p are wired when they are actually not. Then only exitsyscallfast and retake will use the shadow p variable. This should have been caught this bug.

benesch commented 5 years ago

Whoops, sorry, missed your messages! About to send a CL.

I like your idea of a shadow variable but I didn't want to embark on that refactor myself, so I took a less intrusive approach. Now that I have your blessing for the refactor I'm happy to do that instead since it will prevent future mistakes of this kind—I agree that it is extremely confusing that an M can point to a P that it doesn't currently own. But I'll send the CL for what I have in the meantime to get the discussion started.

gopherbot commented 5 years ago

Change https://golang.org/cl/148717 mentions this issue: runtime: never call into race detector without runnable P

benesch commented 5 years ago

I think we also need to change reentersyscall to leave the dangling p pointer in some shadow variable instead of the g.m.p. g.m.p set makes impression that m and p are wired when they are actually not. Then only exitsyscallfast and retake will use the shadow p variable. This should have been caught this bug.

Ok, updated the CL to take this approach.

petermattis commented 5 years ago

Nice sleuthing! Gentle plea for this to be back-ported to go1.11 as this bug is preventing CockroachDB from upgrading to go1.11 (yes, it only affects race builds, but we run go test -race during CI).