Closed randall77 closed 4 years ago
Once again, I am not able to reproduce this failure. Maybe I'm not building and running this like is done with the buildlet.
I checked out the latest golang.org/x/net and rebuilt it with upstream Go. The tests all passed on my debian buster power9 machine.
I see the test says it is testing in module mode. Not exactly sure what that means. Is there documentation on how to run all the golang.org/x repos? I did copy the old go.mod file it was using and set up GOMOD to point to it and it still passed.
I hate unreproducible bugs :(
Try checking out Go and x/net from the two hashes listed in that error log. Then maybe try using a gomote instead of your machine.
I don't think this failure would have anything to do with modules.
Maybe just try writing a helloworld race detector example and see if it works?
I ran all the race tests I know about in golang. They all passed on my power8 and power9.
in runtime/race: go test -race
Passed 348 of 348 tests (100.00%, 0+, 0-)
0 expected failures (0 has not fail)
--- PASS: TestRace (6.42s)
=== RUN TestIssue8102
--- PASS: TestIssue8102 (0.00s)
=== RUN TestIssue9137
--- PASS: TestIssue9137 (0.00s)
=== RUN TestNonGoMemory
--- PASS: TestNonGoMemory (0.00s)
=== RUN TestRandomScheduling
--- PASS: TestRandomScheduling (0.01s)
PASS
ok runtime/race 15.214s
In golang.org/x/net at the commit from the failure:
:~/gocode/src/golang.org/x/net/internal/socket$ go test -test.v
=== RUN TestSocket
=== RUN TestSocket/Option
--- PASS: TestSocket (0.01s)
--- PASS: TestSocket/Option (0.01s)
=== RUN TestControlMessage
--- PASS: TestControlMessage (0.00s)
=== RUN TestUDP
=== RUN TestUDP/Message
=== RUN TestUDP/Messages
--- PASS: TestUDP (0.00s)
--- PASS: TestUDP/Message (0.00s)
--- PASS: TestUDP/Messages (0.00s)
=== RUN TestRace
=== RUN TestRace/test_0
=== RUN TestRace/test_1
--- PASS: TestRace (1.98s)
--- PASS: TestRace/test_0 (0.98s)
--- PASS: TestRace/test_1 (0.99s)
PASS
ok golang.org/x/net/internal/socket 1.997s
boger@yanny6:~/gocode/src/golang.org/x/net/internal/socket$ uname -a
Linux yanny6 4.19.0-5-powerpc64le #1 SMP Debian 4.19.37-5+deb10u1 (2019-07-19) ppc64le GNU/Linux
go tool dist test race
##### Testing race detector
ok runtime/race 7.897s
ok flag 0.025s
ok net 0.104s
ok os 0.115s
ok os/exec 0.083s
ok encoding/gob 0.036s
ok flag 0.028s
ok os/exec 0.077s
ALL TESTS PASSED (some were excluded)
I do not have gomote access. Isn't the error message about personality related to some kind of system or kernel setting?
If I look at the code from LLVM from the error output I see this:
#elif SANITIZER_PPC64V2
// Disable ASLR for Linux PPC64LE.
int old_personality = personality(0xffffffff);
if (old_personality != -1 && (old_personality & ADDR_NO_RANDOMIZE) == 0) {
VReport(1, "WARNING: Program is being run with address space layout "
"randomization (ASLR) enabled which prevents the thread and "
"memory sanitizers from working on powerpc64le.\n"
"ASLR will be disabled and the program re-executed.\n");
CHECK_NE(personality(old_personality | ADDR_NO_RANDOMIZE), -1);
ReExec();
}
So I believe that error message is saying the test is supposed to be able to set the personality to ADDR_NO_RANDOMIZE but that failed.
Can someone disable ASLR in the ppc64le builders (i.e. echo 0 | sudo tee /proc/sys/kernel/randomize_va_space
) and see if it works? I can't reproduce on VMs or bare metal systems and I don't have a Docker instance available at the moment.
Interesting. So that check is failing, so it fails to re-exec itself (which presumably would work?). I don't understand what that check is trying to achieve - is it just that -1 is a special case meaning "failure"? Anyone have the docs to the personality
function?
Instead of fixing the reexec in the race detector, we could set ADDR_NO_RANDOMIZE on the binary that go build -race
(and go run -race
) generates.
Look up the documentation for the Linux personality syscall (you can google it). In this case the value of ADDR_NO_RANDOMIZE is not a setting on the binary but a setting for the personality of the process. If the syscall returns -1 (EINVAL) that means the kernel could not change the personality.
To me it looks like even though the syscall fails it continues to try to ReExec regardless of the syscall result but the comments say the race detector doesn't work in ASLR mode (due to unexpected address ranges) and in this test it is supposed to detect a race condition but does not so fails.
I don't know why it would fail to change the personality in this case -- it must work in some cases because this code is executed whenever using race on ppc64le and distros have had ASLR enabled for a while. Maybe if we had the errno that would help.
OK I think I found out the issue. When a Docker container is set up, there is a default seccomp security profile with a list of syscalls that are not allowed. The personality syscall is on that list. There is a way to define the container so it allows this syscall.
Can I get gomote access so I can try to understand this problem a bit more. It is curious as to why some race tests work fine (I think they must all have to do the personality syscall) but this one does not. Also I am not able to reproduce it.
Sorry for the delay in getting back to you @laboger. I wanted to update you that we're not able to give out gomote access at this time, unfortunately. We're still working on fixing that, but it will take more time.
I realize it's hard to make progress on this issue because it doesn't reproduce easily outside the ppc64le builder, and I'll try to look into what else we can do in the meantime to help make investigating this easier. I just wanted to share this update for now.
Since my last post I was able to reproduce this in a container. I have confirmed the theory that was mentioned in my post on Nov. 18.
When running LLVM's thread sanitizer code (used by Go's race detector) on a system with ASLR enabled, the address mappings needed for the thread sanitizer are not always usable on ppc64le so it calls the personality syscall to disable ASLR. Outside of a container this works fine, but within a container that is running with the default seccomp profile, it fails as in this testcase. This is because the default seccomp profile restricts the use of the personality syscall.
One solution would be to have an alternate security profile which allows the personality syscall and then use that profile for the container where the testing is done.
If you don't care about the seccomp profile for the container then the container can be run using --security-opt seccomp=unconfined. The test works if that option is used.
This does not fail on amd64 because they don't have the same problem with address ranges apparently. It is needed on arm64 but I don't think they have the race detector working for Go yet.
Thanks for figuring that out @laboger !
It sounds to me like there's nothing Go can do to fix or work around this (is that true?). If so, maybe we can add something to the documentation and resolve this bug?
@aclements, it sounds to me like we need to add "--security-opt", "seccomp=unconfined",
to x/build/cmd/rundockerbuildlet.
/cc @toothrot @bcmills
@aclements, it sounds to me like we need to add "--security-opt", "seccomp=unconfined", to x/build/cmd/rundockerbuildlet.
Yes, I think that would be the easiest for now. There could be something documented to state that the use of the race detector on ppc64le within a container requires this option when starting the container.
Change https://golang.org/cl/214919 mentions this issue: cmd/rundockerbuildlet: set --security-opt=seccomp=unconfined in containers
I've redeployed rundockerbuildlet
for these builders. We ought to see a difference the next test run for x/net
.
Change https://golang.org/cl/214979 mentions this issue: test: trybot test
The discussion above suggests that this issue is fixed, so closing. Please comment if you disagree.
Thread sanitizer is failing on linux-ppc64le-buildlet and linux-ppc64le-power9osu
The test is from the
x/net
repository, but I'm not sure that matters much.From https://build.golang.org/log/7ba95758a5a8b138b85ba4a9f87a543c3177e884 :
Some internal check in tsan is failing, causing this test to fail. We claim to support the race detector in this config, so this test should pass.
@laboger