Open vsivsi opened 2 years ago
CC @ianlancetaylor, @prattmic, @FiloSottile.
Hmm, this may be the VZEROUPPER
we introduced in CL 219131 to fix #37174. Or it may be worse than that, I'm not sure.
@cherrymui
@vsivsi Can you try patching out #ifdef GOOS_darwin
clause in runtime/preempt_amd64.s
and see if that helps?
The runtime doesn't async preempt assembly functions. asyncPreempt (the function in runtime/preempt_amd64.s) should not be executed if it is running in an assembly function. The runtime sends a signal, but if the signal lands in an assembly function the signal handler will return immediately without actually preempt.
I suspect that the darwin kernel doesn't preserve AVX512 state when handling signals. Then the problem would not be preemption, but any asynchronous signal could cause it to fail, including profiling signals or user signals.
What version of macOS are you running on?
I have a vague memory that Apple suggests that before using AVX512 one must check its availability with some sysctl (but I'm not really sure). Do we do that?
@randall77 I can try that, but the initial issue I encountered (as reflected in the repro case) was clobbering of the K0-K7 opmask registers, which are distinct from the vector registers affected by VZEROUPPER
, etc.
@cherrymui Yes, Darwin requires special checking to identify AVX512 support. I submitted a fix for this last year. See here for details on that saga: https://github.com/golang/go/issues/43089
It is quite possible that this issue I'm seeing is new with golang 1.17. Is it possible that the new assembly adapter funcs required for inter-operation with the new function parameter passing spec could be interfering with the assembly code detection during preemption?
To maintain compatibility with existing assembly functions, the compiler generates adapter functions that convert between the new register-based calling convention and the previous stack-based calling convention. These adapters are typically invisible to users, except that taking the address of a Go function in assembly code or taking the address of an assembly function in Go code using reflect.ValueOf(fn).Pointer() or unsafe.Pointer will now return the address of the adapter. Code that depends on the value of these code pointers may no longer behave as expected. Adapters also may cause a very small performance overhead in two cases: calling an assembly function indirectly from Go via a func value, and calling Go functions from assembly.
How we detect asm routines during preempt changed in Go 1.18. (https://github.com/golang/go/commit/c2483a5c034152fcdfbb2e6dbcf48b0103d8db6a#diff-47fd68949147d260b91998c0d6eabffcdb74c58991fc386b40e14a6ed710b17d)
Does this reproduce at tip? (I'm AFK.)
It does, with:
% go version
go version devel go1.18-6113dacf32 Sat Oct 30 18:30:34 2021 +0000 darwin/amd64
% sysctl machdep.cpu.brand_string
machdep.cpu.brand_string: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
% sw_vers
ProductName: macOS
ProductVersion: 11.6
BuildVersion: 20G165
It is quite possible that this issue I'm seeing is new with golang 1.17. Is it possible that the new assembly adapter funcs required for inter-operation with the new function parameter passing spec could be interfering with the assembly code detection during preemption?
It seems unlikely. You can try building with Go 1.17 with GOEXPERIMENT=noregabi
environment variable set (at go run/build/test
time).
Also, you can try running the program with preemption disabled but CPU profiling enabled and see if profiling signal changes anything. You can also try running the program with preemption disabled while having a separate process sending SIGURG signals to it. Thanks.
@cherrymui Good suggestions. Running the repro test case as:
% GODEBUG=asyncpreemptoff=1 go test -cpuprofile cpu.out -count 1 -timeout 15m -run '^TestMask$' gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32
--- FAIL: TestMask (0.26s)
maskcheck_test.go:21: Failed for iteration: 0 with return value: 47429074
maskcheck_test.go:21: Failed for iteration: 1 with return value: 45358536
maskcheck_test.go:21: Failed for iteration: 2 with return value: 52675366
maskcheck_test.go:21: Failed for iteration: 3 with return value: 44456167
maskcheck_test.go:21: Failed for iteration: 4 with return value: 72420817
maskcheck_test.go:21: Failed for iteration: 5 with return value: 72654067
maskcheck_test.go:21: Failed for iteration: 6 with return value: 72091434
maskcheck_test.go:21: Failed for iteration: 7 with return value: 73369163
maskcheck_test.go:21: Failed for iteration: 8 with return value: 72828760
maskcheck_test.go:21: Failed for iteration: 9 with return value: 73065114
maskcheck_test.go:21: Failed for iteration: 10 with return value: 73038149
maskcheck_test.go:21: Failed for iteration: 11 with return value: 72684990
maskcheck_test.go:21: Failed for iteration: 12 with return value: 72653599
maskcheck_test.go:21: Failed for iteration: 13 with return value: 72440260
maskcheck_test.go:21: Failed for iteration: 14 with return value: 72201890
maskcheck_test.go:21: Failed for iteration: 15 with return value: 45021003
maskcheck_test.go:21: Failed for iteration: 16 with return value: 72012922
maskcheck_test.go:21: Failed for iteration: 17 with return value: 44692569
maskcheck_test.go:21: Failed for iteration: 18 with return value: 72992403
maskcheck_test.go:21: Failed for iteration: 19 with return value: 73178568
FAIL
FAIL gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32 0.601s
FAIL
This leads to the test failing, and with different looking stats (each iteration fails earlier in the loop on average, probably due to different signaling rates between profiling and preemption).
Two other data points:
@vsivsi thanks for the repro! So it seems it is not related preemption but signals in general. The observation about K0-K7 also suggests VZEROUPPER is not the cause.
Would it be possible to write a similar test in C and assembly (GNU syntax) and see if it is failing?
Maybe we need a different CPU feature probe to tell the kernel we're using K0-K7? (I haven't looked at the code carefully. Just a guess.)
@cherrymui The mechanism for using AVX-512 in Darwin doesn't involve telling the kernel which features to enable. The kernel simply advertises the available AVX-512 features in the process commpage data area. Once you've verified that a given required set of AVX-512 features is available, you simply use them. Darwin catches the UD interrupt for any AVX-512 instruction on first use, and promotes that thread to use the full AVX512 thread state, enabling every available feature.
Specifically, the three XCR0[5:7] bits (from XGETBV) indicate whether XSAVE will preserve:
Upon thread promotion following the UD interrupt from a supported AVX512 instruction, Darwin sets XCR0[5:7] = 111b, for the thread, enabling the full "AVX512 state" in XSAVE.
The kernel definitions of these XCR0 bits are here: https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad951987ef5bab20ec9486310/osfmk/i386/proc_reg.h#L168-L188
The full AVX512 enabled mask is composed here: https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad951987ef5bab20ec9486310/osfmk/i386/fpu.h#L71
And this is the function that runs when a UD for an AVX512 instruction is trapped, triggering the XSAVE state promotion: https://github.com/apple/darwin-xnu/blob/8f02f2a044b9bb1ad951987ef5bab20ec9486310/osfmk/i386/fpu.c#L1472
@cherrymui Here is an attempted reproduction of this issue in GNU C and Asm. In my testing it does not fail.
https://gist.github.com/vsivsi/8511aca1bac528f49fbb45a636afa4b5
% gcc testmask.c testmask.s && ./a.out &
[1] 27524
% for ((x=0;x<1000000;x++)); do kill -s URG 27524; done
Test Passed!
[1] + done ./a.out
Thanks for trying that. In your C code, could you install a signal handler for SIGURG (which could simply return immediately)? Thanks.
@cherrymui Gist updated with empty SIGURG handler. Now it fails as with go runtime:
% gcc testmask.c testmask.s && ./a.out &
[1] 28786
% for ((x=0;x<20;x++)); do kill -s URG 28786; sleep 0.01; done
Failed for iteration: 133 with return value: 90069936
Failed for iteration: 134 with return value: 79705026
Failed for iteration: 135 with return value: 79757341
Failed for iteration: 136 with return value: 79875856
Failed for iteration: 137 with return value: 82261267
Failed for iteration: 138 with return value: 83127231
Failed for iteration: 139 with return value: 83641375
Failed for iteration: 140 with return value: 83352568
Failed for iteration: 141 with return value: 82555528
Failed for iteration: 142 with return value: 79494222
Failed for iteration: 143 with return value: 78981164
Failed for iteration: 144 with return value: 81261765
Failed for iteration: 145 with return value: 80885813
Failed for iteration: 146 with return value: 79706999
Failed for iteration: 147 with return value: 83441689
Failed for iteration: 148 with return value: 80171870
Failed for iteration: 149 with return value: 79769830
Failed for iteration: 150 with return value: 81361308
Failed for iteration: 151 with return value: 81237987
Failed for iteration: 152 with return value: 79568372
Test failed 20 times!
[1] + done ./a.out
This reminds me of https://github.com/golang/go/issues/37174#issuecomment-584914106 Sounds like Apple needs some more regression tests for their sigaction code path.
I'm sure you all are happy that this doesn't appear to be a problem in golang, but it still seems like a problem for golang, no?
Thanks @vsivsi ! That indicates the kernel's signal handling code doesn't seem to do the right thing.
So what's the next step here? Throwing a process into an indeterminate state via external signal feels like a CVE-level issue.
If there is a simple workaround that we could do it in the runtime, we'll do it, like the VZEROUPPER for #37174 . I don't know how to work around this one, though. One probably has to either not use K0-K7 registers or block all signals.
I guess the next step would be reporting to Apple and have them fix the kernel.
While they're at it, they should fix this one too: https://github.com/golang/go/issues/42649
Also, not using the K0-K7 opmasks == not using AVX-512 for all practical purposes.
We need to report a bug to Apple. I can do that - I've done it before, and supposedly a report from Google has a bit more visibility than one from random internet user. Any idea what range of OS versions this causes a problem on? I don't see a OS version in the OP.
@randall77 I've seen it on Catalina (10.15.x) and Big Sur (11.x), but I bet it goes all the way back to the first AVX-512 support, which I believe was High Sierra (10.13.x) when the first iMac Pro released with a Xeon supporting AVX-512.
As it is not safe to use, maybe we could consider hardwire AVX512 to false on Darwin until the kernel fix?
We could hardwire internal/cpu (and x/sys/cpu) flags to false for avx512. But if the asm uses cpuid itself to figure out what is available, I don't think we'd be able to prevent avx512 use in that case.
The only influence golang has over this is the value set for x/sys/cpu.X86.HasAVX512
which anyone is free to ignore (at their potential peril). But temporarily disabling that won't solve the problem of existing go binaries, or running on unpatched Darwin after Apple releases a fix. The robust solution would be to actually test for the live vuln when setting x/sys/cpu.X86.HasAVX512
at runtime.
Here's a simplified reproduction, just run it and see if it prints anything:
testmask.c
#include <stdio.h>
#include <inttypes.h>
#include <signal.h>
#include <unistd.h>
#include <pthread.h>
volatile int8_t done = 0;
// in testmask.s
uint32_t masktest(uint32_t x);
void sig_handler(int signum) {
// Tell the assembly loop to return.
done = 1;
}
void *tester(void* unused) {
uint32_t x = 0xabcd1234;
uint32_t y = masktest(x);
if (x != y) {
printf("bad: %x -> %x\n", x, y);
}
return NULL;
}
int main(int argc, char *argv[]) {
// Set up simple signal handler.
signal(SIGURG, sig_handler);
// Start worker.
pthread_t worker;
pthread_create(&worker, NULL, tester, NULL);
// Wait until the worker is in its assembly loop.
usleep(1000);
// Send a signal to the worker.
pthread_kill(worker, SIGURG);
// Wait until worker is done.
pthread_join(worker, NULL);
}
testmask.s
.globl _done
.globl _masktest
_masktest:
// Put a value in the k1 avx512 register.
kmovd %edi, %k1
// Wait until we're told to return.
loop:
cmpb $0, _done(%rip)
je loop
// Return the value in the k1 register.
kmovd %k1, %eax
ret
Build with gcc testmask.c testmask.s
and run with ./a.out
.
Mac OSX bug report here: https://feedbackassistant.apple.com/feedback/9736652 (If you can see it, not sure how they deal with permissions.)
There doesn't seem to be an easy way to let anyone else see my bug report. So here's a screenshot of it:
Change https://golang.org/cl/361255 mentions this issue: cpu: pretend AVX-512 is disabled on Darwin
Posted to Intel software dev forum:
As far as I can tell, the opmask register state is saved on entry to the signal handler, as the value from the opmask register appears in a dump of the signal context from within the signal handler. The problem must be on the return side.
Seemed like a long-shot, but I just re-tested after updating MacOS with this morning's security updates. This issue persists in the latest builds:
System Version: macOS 11.6.1 (20G224)
Kernel Version: Darwin 20.6.0
System Version: macOS 10.15.7 (19H1519)
Kernel Version: Darwin 19.6.0
@randall77 @cherrymui
I'm nearly certain I've found the kernel bug responsible for this issue by inspection.
It looks like it was introduced in Catalina 10.15.6, and has always existed in Big Sur 11.x. I'm presently installing Monterey on one of my MacBook Pros to see what the status there is, since Apple hasn't yet released the Darwin source for MacOS 12.x. Edit to confirm: The issue also exists in Monterey 12.0.1 / Darwin 21.1.0 (21A559).
It seems not to exist in Mojave 10.14.x or before, although that would only impact the iMac Pro, since that was the only pre-10.15 machine with AVX-512.
The issue is a set of changes introduced in 10.15.6 that are aimed at improving performance when AVX-512 is on, but the thread isn't using the ZMM extended state:
This is analogous to your VZEROUPPER optimization for eliminating phantom SSE4 dependencies. But, when the code detects that ZMM_Hi256 state and Hi16_ZMM state are all zeros, it disables XSAVE restore of the "opmask state" as well. This is not a safe assumption, as it is perfectly valid to have non-zero values in K0-K7 while having all zeroed vector registers (as our various reproductions likely illustrate). With AVX-512VL, it is also valid to use EVEX encoded AVX-512 instructions, with opmasks, while only touching the AVX YMM vector state.
My hypothesis above is confirmed by adding a single instruction to the reproduction assembly code sent to Apple:
testmask.s
.globl _done
.globl _masktest
_masktest:
// Storing a non-zero value in ZMM31 suppresses the kernel bug.
vpbroadcastd %edi, %zmm31
// Put a value in the k1 avx512 register.
kmovd %edi, %k1
// Wait until we're told to return.
loop:
cmpb $0, _done(%rip)
je loop
// Return the value in the k1 register.
kmovd %k1, %eax
ret
So, here's the offending code in 10.15.6:
if (fpu_allzeroes((uint64_t *)(void *)iavx->x_Hi16_ZMM, 16 * sizeof(_STRUCT_ZMM_REG)) == TRUE &&
fpu_allzeroes((uint64_t *)(void *)iavx->x_ZMM_Hi256, 16 * sizeof(_STRUCT_YMM_REG)) == TRUE) {
iavx->_xh.xstate_bv &= ~XFEM_ZMM;
}
From: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/fpu.c#L954
Where:
#define XFEM_ZMM (XFEM_ZMM_HI256 | XFEM_HI16_ZMM | XFEM_OPMASK)
From: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/proc_reg.h#L187
This change was introduced in Darwin xnu-6153.141.1, which corresponds with MacOS10.15.6.
So, the fix is to probably replace a single line of code in two locations (for the 32 and 64 bit cases):
// iavx->_xh.xstate_bv &= ~XFEM_ZMM;
iavx->_xh.xstate_bv &= ~(XFEM_ZMM_HI256 | XFEM_HI16_ZMM);
Here: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/fpu.c#L928 and Here: https://github.com/apple/darwin-xnu/blob/xnu-6153.141.1/osfmk/i386/fpu.c#L954
Now we just need to get Apple to do it. In the meantime there's a weak-ish workaround... Ensure that there's always non-zero values in some part of the AVX-512 extended ZMM state.
Weird. When debugging I thought a use of a zmm register might help, so I added:
vpxor %zmm1, %zmm1, %zmm1
and it didn't help. Apparently I needed to set zmm1 to a nonzero value to really "use" that register :(
I'll update my Apple feedback submission.
@randall77 Yeah, I went back and checked the original rarely failing unit test case that kicked this all off... It was Avo generated code that only used the YMM
registers with opmasks in a 4x64 bit special case. I didn't pick up on the significance of that while performing the debugging that ultimately led to the repro I brought to this issue.
In thinking about the kernel bug presented above overnight, it clarified three things:
On reflection, I'd also like to add that the "one line fix" I presented above may not simultaneously resolve this issue while still satisfying the aims of the Darwin kernel engineers in making the revisions that appeared in Catalina 10.15.6. Another possible fix would be to leave the masking code the same, but add an additional check to ensure that the opmask state also has all zeroed bits. Or perhaps the opmask save/restore should be broken out as a separate case (combining the two approaches).
In any case, I hope Apple carefully audits all uses of XFEM_ZMM
(perhaps renaming it in the process to make clear it encompasses the opmask bit), and adds regression testing to ensure this type of confusion doesn't reoccur.
Just a quick update on point 3) in the post above. I've run the reproduction C/Asm code here: https://gist.github.com/vsivsi/8511aca1bac528f49fbb45a636afa4b5
As:
for ((x=0;x<250;x++)); do ./a.out & done
When run on a Mac Pro with 48 hyperthreads, each of the 250 processes created above is preemptively multitasked onto ~20% of a hyperthread with a large number of kernel scheduler preemptions per second.
In over an hour of CPU time, I've seen zero opmask corruptions. So I think it's fair to conclude that Darwin's preemptive multitasking does not use the faulty thread state resetting code spotted above.
😅
Change https://golang.org/cl/368994 mentions this issue: x/sys/cpu: remove the declaration of function darwinSupportsAVX512
Yesterday's security update to MacOS Catalina appears not to have updated the kernel, and my testing indicates this bug remains.
System Version: macOS 10.15.7 (19H1615)
Kernel Version: Darwin 19.6.0
Likewise, the upgrade to MacOS Monterey 12.1 also does not fix this issue.
System Version: macOS 12.1 (21C52)
Kernel Version: Darwin 21.2.0
I haven't tested the security update for Big Sur, but it would be very surprising if this had been fixed there and only there.
@randall77 @cherrymui
Apple appears to have fixed this kernel bug in Monterey 12.2, but frustratingly did not backport the fix into the corresponding security fixes for Catalina and Big Sur.
So per my testing this morning:
Fixed:
System Version: macOS 12.2 (21D49)
Kernel Version: Darwin 21.3.0
Still Broken:
System Version: macOS 11.6.3 (20G415)
Kernel Version: Darwin 20.6.0
System Version: macOS 10.15.7 (19H1713)
Kernel Version: Darwin 19.6.0
Can someone please verify the above results?
Pending confirmation of the Monterey 12.2 fix, we should put together a patch for x/sys/cpu.X86
to re-enable AVX-512 detection for Darwin 21.3 or later.
In preparation for re-enabling AVX-512 for fully patched MacOS >= Monterey 12.2, I've done a little research on the best way to detect the kernel version in Golang on Darwin.
The Darwin kernel version is not present in the process commpage where the AVX-512 capabilities are advertised. So it appears that the path of least resistance is going to be a call to x/sys/unix.Uname()
which I have already verified works as expected on Darwin. The Release
field of the unix.Utsname
struct contains the dotted kernel version number in a \x00
terminated [256]byte
. E.g. 21.3.0
followed by 250 \x00
s.
This function will return true
if the runtime kernel version is >= to the passed major/minor/patch values.
func darwinKernelVersionCheck(major, minor, patch int) bool {
var un unix.Utsname
err := unix.Uname(&un)
if err != nil {
return false
}
ts := bytes.Split(bytes.TrimRight(un.Release[:], "\x00"), []byte{'.'})
if len(ts) != 3 {
return false
}
var mmp [3]int
for i := 0; i < 3; i++ {
tmp, err := strconv.Atoi(string(ts[i]))
if err != nil {
return false
}
mmp[i] = tmp
}
return mmp[0] > major || mmp[0] == major && (mmp[1] > minor || mmp[1] == minor && mmp[2] >= patch)
}
Re-enabling AVX-512 support for fully patched MacOS Monterey machines should simply involve:
Revert all changes in these two revisions:
Add the above function darwinKernelVersionCheck()
to x/sys/cpu/cpu_X86.go
, and modify the Darwin initialization of osSupportsAVX512
to:
if runtime.GOOS == "darwin" {
// Check for minimum darwin version and commpage for AVX512 support. Necessary because:
// https://github.com/apple/darwin-xnu/blob/0a798f6738bc1db01281fc08ae024145e84df927/osfmk/i386/fpu.c#L175-L201
// See also issue: 49233
osSupportsAVX512 = osSupportsAVX && darwinSupportsAVX512() && darwinKernelVersionCheck(21, 3, 0)
} else {
// ... non-darwin case
Edited to move the call to darwinKernelVersionCheck(21, 3, 0)
to the last position, since it is much more expensive than darwinSupportsAVX512()
which will be false on most Macs and short circuit the test, avoiding the final call.
BTW, if we prefer a version of darwinKernelVersionCheck()
without dependencies on bytes
and strconv
, the following function is equivalent:
func darwinKernelVersionCheck(major, minor, patch int) bool {
var un unix.Utsname
err := unix.Uname(&un)
if err != nil {
return false
}
var mmp [3]int
c := 0
Loop:
for _, b := range un.Release {
switch {
case b >= '0' && b <= '9':
mmp[c] = 10*mmp[c] + int(b-'0')
case b == '.':
c++
if c > 2 {
return false
}
case b == 0:
break Loop
default:
return false
}
}
if c != 2 {
return false
}
return mmp[0] > major || mmp[0] == major && (mmp[1] > minor || mmp[1] == minor && mmp[2] >= patch)
}
I've submitted a fix for this issue here: https://go-review.googlesource.com/c/sys/+/382314
Change https://go.dev/cl/560955 mentions this issue: runtime: remove VZEROUPPER in asyncPreempt on darwin/amd64
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
In writing AVX-512 assembly functions using the Go assembler (via Avo), I began to notice unexplained intermittent test failures. A lot of painful investigation revealed that golang async preemption does not seem to be properly save/restoring AVX512 state when assembly functions are interrupted. From visual inspection of the
runtime/preempt_amd64.s
source, it also seems likely that AVX/AVX2 state (YMM upper 128 bits) may not be properly save/restored either, but I have not encountered/tested that case in my own code.A complete minimal reproduction illustrating the clobbering of the AVX512 K1 opmask register is here: https://gist.github.com/vsivsi/fff8618ace4b02eb410dd8792779bf32
Note, in my testing running with
GODEBUG=asyncpreemptoff=1
rescues every failure case I have identified.What did you expect to see?
I expected all relevant processor state to be properly restored in an assembly function following an async preemption.
What did you see instead?
Critical state is being clobbered, leading to intermittent undefined behavior.