golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.85k stars 17.65k forks source link

runtime, x/net/internal/socket: thread sanitizer failing on ppc64le #35547

Closed randall77 closed 4 years ago

randall77 commented 4 years ago

Thread sanitizer is failing on linux-ppc64le-buildlet and linux-ppc64le-power9osu

The test is from the x/net repository, but I'm not sure that matters much.

From https://build.golang.org/log/7ba95758a5a8b138b85ba4a9f87a543c3177e884 :

--- FAIL: TestRace (16.94s)
    --- FAIL: TestRace/test_0 (16.04s)
        socket_test.go:363: race not detected for test 0: err:<nil> out:FATAL: ThreadSanitizer CHECK failed: ./gotsan.cpp:14108 "((personality(old_personality | ADDR_NO_RANDOMIZE))) != ((-1))" (0xffffffffffffffff, 0xffffffffffffffff)
    --- FAIL: TestRace/test_1 (0.91s)
        socket_test.go:363: race not detected for test 1: err:<nil> out:FATAL: ThreadSanitizer CHECK failed: ./gotsan.cpp:14108 "((personality(old_personality | ADDR_NO_RANDOMIZE))) != ((-1))" (0xffffffffffffffff, 0xffffffffffffffff)
FAIL
FAIL    golang.org/x/net/internal/socket    16.953s

Some internal check in tsan is failing, causing this test to fail. We claim to support the race detector in this config, so this test should pass.

@laboger

laboger commented 4 years ago

Once again, I am not able to reproduce this failure. Maybe I'm not building and running this like is done with the buildlet.

I checked out the latest golang.org/x/net and rebuilt it with upstream Go. The tests all passed on my debian buster power9 machine.

I see the test says it is testing in module mode. Not exactly sure what that means. Is there documentation on how to run all the golang.org/x repos? I did copy the old go.mod file it was using and set up GOMOD to point to it and it still passed.

randall77 commented 4 years ago

I hate unreproducible bugs :(

Try checking out Go and x/net from the two hashes listed in that error log. Then maybe try using a gomote instead of your machine.

I don't think this failure would have anything to do with modules.

Maybe just try writing a helloworld race detector example and see if it works?

laboger commented 4 years ago

I ran all the race tests I know about in golang. They all passed on my power8 and power9.

in runtime/race: go test -race


Passed 348 of 348 tests (100.00%, 0+, 0-)
0 expected failures (0 has not fail)
--- PASS: TestRace (6.42s)
=== RUN   TestIssue8102
--- PASS: TestIssue8102 (0.00s)
=== RUN   TestIssue9137
--- PASS: TestIssue9137 (0.00s)
=== RUN   TestNonGoMemory
--- PASS: TestNonGoMemory (0.00s)
=== RUN   TestRandomScheduling
--- PASS: TestRandomScheduling (0.01s)
PASS
ok      runtime/race    15.214s

In golang.org/x/net at the commit from the failure:

:~/gocode/src/golang.org/x/net/internal/socket$ go test -test.v
=== RUN   TestSocket
=== RUN   TestSocket/Option
--- PASS: TestSocket (0.01s)
    --- PASS: TestSocket/Option (0.01s)
=== RUN   TestControlMessage
--- PASS: TestControlMessage (0.00s)
=== RUN   TestUDP
=== RUN   TestUDP/Message
=== RUN   TestUDP/Messages
--- PASS: TestUDP (0.00s)
    --- PASS: TestUDP/Message (0.00s)
    --- PASS: TestUDP/Messages (0.00s)
=== RUN   TestRace
=== RUN   TestRace/test_0
=== RUN   TestRace/test_1
--- PASS: TestRace (1.98s)
    --- PASS: TestRace/test_0 (0.98s)
    --- PASS: TestRace/test_1 (0.99s)
PASS
ok      golang.org/x/net/internal/socket    1.997s
boger@yanny6:~/gocode/src/golang.org/x/net/internal/socket$ uname -a
Linux yanny6 4.19.0-5-powerpc64le #1 SMP Debian 4.19.37-5+deb10u1 (2019-07-19) ppc64le GNU/Linux
go tool dist test race

##### Testing race detector
ok      runtime/race    7.897s
ok      flag    0.025s
ok      net 0.104s
ok      os  0.115s
ok      os/exec 0.083s
ok      encoding/gob    0.036s
ok      flag    0.028s
ok      os/exec 0.077s

ALL TESTS PASSED (some were excluded)

I do not have gomote access. Isn't the error message about personality related to some kind of system or kernel setting?

laboger commented 4 years ago

If I look at the code from LLVM from the error output I see this:

#elif SANITIZER_PPC64V2
  // Disable ASLR for Linux PPC64LE.
  int old_personality = personality(0xffffffff);
  if (old_personality != -1 && (old_personality & ADDR_NO_RANDOMIZE) == 0) {
    VReport(1, "WARNING: Program is being run with address space layout "
               "randomization (ASLR) enabled which prevents the thread and "
               "memory sanitizers from working on powerpc64le.\n"
               "ASLR will be disabled and the program re-executed.\n");
    CHECK_NE(personality(old_personality | ADDR_NO_RANDOMIZE), -1);
    ReExec();
  }

So I believe that error message is saying the test is supposed to be able to set the personality to ADDR_NO_RANDOMIZE but that failed.

ceseo commented 4 years ago

Can someone disable ASLR in the ppc64le builders (i.e. echo 0 | sudo tee /proc/sys/kernel/randomize_va_space) and see if it works? I can't reproduce on VMs or bare metal systems and I don't have a Docker instance available at the moment.

randall77 commented 4 years ago

Interesting. So that check is failing, so it fails to re-exec itself (which presumably would work?). I don't understand what that check is trying to achieve - is it just that -1 is a special case meaning "failure"? Anyone have the docs to the personality function?

Instead of fixing the reexec in the race detector, we could set ADDR_NO_RANDOMIZE on the binary that go build -race (and go run -race) generates.

laboger commented 4 years ago

Look up the documentation for the Linux personality syscall (you can google it). In this case the value of ADDR_NO_RANDOMIZE is not a setting on the binary but a setting for the personality of the process. If the syscall returns -1 (EINVAL) that means the kernel could not change the personality.

To me it looks like even though the syscall fails it continues to try to ReExec regardless of the syscall result but the comments say the race detector doesn't work in ASLR mode (due to unexpected address ranges) and in this test it is supposed to detect a race condition but does not so fails.

I don't know why it would fail to change the personality in this case -- it must work in some cases because this code is executed whenever using race on ppc64le and distros have had ASLR enabled for a while. Maybe if we had the errno that would help.

laboger commented 4 years ago

OK I think I found out the issue. When a Docker container is set up, there is a default seccomp security profile with a list of syscalls that are not allowed. The personality syscall is on that list. There is a way to define the container so it allows this syscall.

laboger commented 4 years ago

Can I get gomote access so I can try to understand this problem a bit more. It is curious as to why some race tests work fine (I think they must all have to do the personality syscall) but this one does not. Also I am not able to reproduce it.

dmitshur commented 4 years ago

Sorry for the delay in getting back to you @laboger. I wanted to update you that we're not able to give out gomote access at this time, unfortunately. We're still working on fixing that, but it will take more time.

I realize it's hard to make progress on this issue because it doesn't reproduce easily outside the ppc64le builder, and I'll try to look into what else we can do in the meantime to help make investigating this easier. I just wanted to share this update for now.

laboger commented 4 years ago

Since my last post I was able to reproduce this in a container. I have confirmed the theory that was mentioned in my post on Nov. 18.

When running LLVM's thread sanitizer code (used by Go's race detector) on a system with ASLR enabled, the address mappings needed for the thread sanitizer are not always usable on ppc64le so it calls the personality syscall to disable ASLR. Outside of a container this works fine, but within a container that is running with the default seccomp profile, it fails as in this testcase. This is because the default seccomp profile restricts the use of the personality syscall.

One solution would be to have an alternate security profile which allows the personality syscall and then use that profile for the container where the testing is done.

If you don't care about the seccomp profile for the container then the container can be run using --security-opt seccomp=unconfined. The test works if that option is used.

This does not fail on amd64 because they don't have the same problem with address ranges apparently. It is needed on arm64 but I don't think they have the race detector working for Go yet.

aclements commented 4 years ago

Thanks for figuring that out @laboger !

It sounds to me like there's nothing Go can do to fix or work around this (is that true?). If so, maybe we can add something to the documentation and resolve this bug?

bradfitz commented 4 years ago

@aclements, it sounds to me like we need to add "--security-opt", "seccomp=unconfined", to x/build/cmd/rundockerbuildlet.

/cc @toothrot @bcmills

laboger commented 4 years ago

@aclements, it sounds to me like we need to add "--security-opt", "seccomp=unconfined", to x/build/cmd/rundockerbuildlet.

Yes, I think that would be the easiest for now. There could be something documented to state that the use of the race detector on ppc64le within a container requires this option when starting the container.

gopherbot commented 4 years ago

Change https://golang.org/cl/214919 mentions this issue: cmd/rundockerbuildlet: set --security-opt=seccomp=unconfined in containers

toothrot commented 4 years ago

I've redeployed rundockerbuildlet for these builders. We ought to see a difference the next test run for x/net.

gopherbot commented 4 years ago

Change https://golang.org/cl/214979 mentions this issue: test: trybot test

laboger commented 4 years ago

The thread sanitizer failure was fixed on ppc64le with CL 214919.

ianlancetaylor commented 4 years ago

The discussion above suggests that this issue is fixed, so closing. Please comment if you disagree.