Open bcmills opened 2 years ago
One more this week. Looking like a regression of some sort.
greplogs --dashboard -md -l -e 'FAIL: TestRaiseException' --since=2021-11-19
Hmmm, this might have been flaky for a while. I didn't realize it then, as I thought it was related to the crashdump change and it "went away" after a sync, but it was probably a flakiness I saw. My guess is that the recent regression is caused by changed timing due to the scavenger changes, but that's just a guess. The flakiness should be fixed anyway, I'll put it on the list.
crash_test.go:106: C:\Users\gopher\AppData\Local\Temp\1\go-build1560626842\testprog.exe RaiseException exit status: exit status 2989 syscall_windows_test.go:636: No stack trace:
I spent some time staring at this. 2989
is 0xbad
from
I also run runtime.TestRaiseException
and it would not fail on me. I also added t.Logf
in the test to print stacktrace, and I can see that when test succeeds, the child process always returns 2.
Go runtime calls runtime.exit(2)
from our Windows exception handler.
I suspect in this failure runtime.exit(2)
call was not reached. Either runtime crashed in exception handler, or Windows did not call our exception handler for some reason.
Alex.
I was suspecting that something was spooky here The exception technically comes from non-go code, so I was wondering if that's simply the reason we don't get a stack trace and - for some reason I don't understand - the test for non-go-code did not come to the same conclusion before? But I could completely misunderstand it too. When I tested, an exception I caused "directly" by e.g. reading a protected memory page gave stack dumps, but when I raised the exception by calling RaiseException, it became "flaky" somewhere between 1.17 and 1.18.
I was suspecting that something was spooky here The exception technically comes from non-go code, so I was wondering if that's simply the reason we don't get a stack trace and - for some reason I don't understand - the test for non-go-code did not come to the same conclusion before? But I could completely misunderstand it too.
I don't understand what you are saying. runtime.isgoexception
should always return false
in our scenario. Because the function only returns true
if exception code is _EXCEPTION_ACCESS_VIOLATION
, _EXCEPTION_INT_DIVIDE_BY_ZERO
and other consts in that case statement. But we pass 0xbad
value to RaiseException
, and 0xbad
does not match any of the consts in the case.
When I tested, an exception I caused "directly" by e.g. reading a protected memory page gave stack dumps, but when I raised the exception by calling RaiseException, it became "flaky" somewhere between 1.17 and 1.18.
I guess, your testing confirms that we do have new problem here. How do I reproduce your "flaky" testing?
Thank you.
Alex
I was suspecting that something was spooky here The exception technically comes from non-go code, so I was wondering if that's simply the reason we don't get a stack trace and - for some reason I don't understand - the test for non-go-code did not come to the same conclusion before? But I could completely misunderstand it too.
I don't understand what you are saying.
runtime.isgoexception
should always returnfalse
in our scenario. Because the function only returnstrue
if exception code is_EXCEPTION_ACCESS_VIOLATION
,_EXCEPTION_INT_DIVIDE_BY_ZERO
and other consts in that case statement. But we pass0xbad
value toRaiseException
, and0xbad
does not match any of the consts in the case.
You're totally right and I'm clearly wrong. I think you didn't understand what I was saying because what I was saying made no sense :) I was specifically throwing EXCEPTION_ACCESS_VIOLATION in my test program, why my theory was potentially a little less out there back then...
When I tested, an exception I caused "directly" by e.g. reading a protected memory page gave stack dumps, but when I raised the exception by calling RaiseException, it became "flaky" somewhere between 1.17 and 1.18.
I guess, your testing confirms that we do have new problem here. How do I reproduce your "flaky" testing?
I just used a small program calling RaiseException through cgo:
package main
// #include <windows.h>
// #include <stdio.h>
// #include <stdlib.h>
// #include <errhandlingapi.h>
//
// static void raise() {
// RaiseException(EXCEPTION_ACCESS_VIOLATION, 0,0,0);
// }
import "C"
import (
"time"
)
func main() {
time.Sleep(time.Second)
C.raise()
return
}
It became much less probable to succeed in dumping a stack trace during 1.18 development, but as you can see, it already needed a sleep to work to begin with, so it wasn't water-proof from the start.
However, after trying to reproduce now, I think it's a red herring, as after @randall77 's https://go-review.googlesource.com/c/go/+/364174, the program above consistently works, even without the sleep. In fact, after that change, none of my sometimes flaky tests programs trying to create exceptions in different ways ever fail to stack dump. That makes no sense to me, looking at the change itself. And also, it's completely opposite to what shows up in the builders, it rather started happening in the builders after that change, or a few days after.
So now I have no idea, not even a bad one.
Thank you.
Alex
Thanks!
Patrik
I just used a small program calling RaiseException through cgo:
runtime.TestRaiseException
does not uses cgo. I expect cgo and no-cgo runtimes are very different from each other. If your small cgo program
works fine (dumps stack trace every time), that does not prove anything.
Alex
Yea, probably not. I'll see if I can make it happen on a gomote with the testcase itself, to see if there's any way to reproduce.
I managed to get some repros with some logging. Not conclusive in any way, but, as you mostly had concluded already, it does reach exceptionhandler (in signal_windows.go), which does return _EXCEPTION_CONTINUE_SEARCH, but then neither firstcontinuehandler nor lastcontinuehandler are called, as far as I can see they are not crashing, they are not called at all.
I don't know if it's really relevant, but it appears the change that I mentioned above, 364174, also is makes this error much more probable. I've not been able to reproduce it once before that change, but by spamming the gomotes, I can reproduce within a couple of minutes after that change. I still have no clue why that change would affect exception handling, of course. It might just change the timing or something.
It seems that in a successful run, exceptionhandler
is called first, returning as isgoexception
is false, then firstcontinuehandler
is called, then lastcontinuehandler
is called which prints the stack trace. In a failed run, though, exceptionhandler
is called and returned as usual, but firstcontinuehandler
and lastcontinuehandler
are not called. Instead, it exits with error code 2989 (0xbad, instead of the usual 2). I don't know why the continue handlers are not called some time.
Yes. I don't know if the trampoline functions in asm are called though. There is some code there beyond my comprehension in sys_windows_amd64:sigtramp (no surprise there :)), also using tls, which was very vaguely related to the change 364174. It does do some stuff to make sure it's running on g0 stack, I don't know if that could be where the flakiness comes from?
I think sigtramp is also not called (for firstcontinuehandler
and lastcontinuehandler
). I added some code to print in sigtramp and I didn't see it print.
Yeah, I also cannot reproduce before CL 364174 or at tip just reverting CL 364174. Weird. That code runs very early at startup. I wonder how it could cause nondeterministic failure.
It also seems to fail more likely if I set GOGC=1. I tried disabling async preemption but it doesn't seem to have any effect.
Ah, interesting. I tried (somewhat arbitrarily) to re-register the exception handlers before returning _EXCEPTION_CONTINUE_SERACH from exceptionhandler, that did not affect what happened at all (not sure what I expected though). It almost seems like the return value seen by windows exception handling mechanism is not seeing _EXCEPTION_CONTINUE_SEARCH, but something else?
It seems if I increase the 5*8
to 5*8+7
at https://cs.opensource.google/go/go/+/master:src/runtime/asm_amd64.s;l=152 (it was 4*8+7 before that CL), I can no longer reproduce the failure. Reading the code I don't see where it clobbers anything on the stack, though. Maybe there was something right below SP at the entry of rt0_go and should not be clobbered?
Heh, there's something going on there, definitely. If I make that specific change, the cgo program that started emitting stack dumps after 364174 stopped dumping stack dumps again. The change on that line clearly affects exception handlers, but I have no idea in which way or why. Can it just be some arbitrary memory corruption going on? The stack being slightly different just makes one thing more probable than another, but in fact, something else is broken?
I'm completely confused by the switching between ABIs in sigtramp, and I don't really get what ABI is used to call the Go code. Is it called using ABI0? What makes it not being called using ABIInternal? Wrappers? I might need an ABI for dummies lesson :)
firstcontinuetramp
is called using the system ABI. firstcontinuetramp
passes the PC of the ABI0 symbol (as there is no <ABIInternal>
marker) of firstcontinuehandler
to sigtramp
, so the ABI0 function is called. That function is actually a wrapper, which calls the actual ABIInternal Go function firstcontinuehandler
, and returns its value back in ABI0.
I added some print and it doesn't seem we clobbered anything. For both success run and failed run the word right below SP at entry of rt0_go is 0, and in both case we stored the same value there (this store https://cs.opensource.google/go/go/+/master:src/runtime/asm_amd64.s;l=155), and the first a few words above that are never changed and the same for both cases. So it is not that we clobbered something on that stack. Still weird.
Thanks for checking and explaining the ABI and the wrapper! At least now I understand sigtramp, which seems to do what's expected. This is indeed a weird one...
I also tried to change sigtramp to always return 0 (which is not always correct but is all what we need for this test) and it can still fail. So it is not the return value.
Yea, I printed out AX at the end of sigtramp and it showed the same thing. So I can't see anyway we're returning the wrong value. Unfortunately...
I don't know what's up here. One thing is that the ones that are not run are "ContinueHandlers" and the one that's run is an "ExceptionHandlers". They are just run in sequence according to the docs I could find, but there might be some criteria for not running the continuehandlers that we hit. I couldn't find anything and the docs are sparse. MS docs I found basically say exceptionhandlers run, then if the last one returns EXCEPTION_CONTINUE_SEARCH, continuehandlers are run. But maybe there's some other criteria, like the stack has to look in a certain way (which would be kind of strange as they allegedly get their own stack frame). Maybe someone with access to the source could shed some light? @jstarks - do you by any chance have time to look at if there's some criteria that will skip continuehandlers?
Also I've had problems reproducing with the latest dev version lately, there's absolutely zero stuff that should affect this checked in lately as I can see. I will keep digging.
@cherrymui @bufflig
The root cause of the problem may be this scenario:
The ring3 of windows x64 has three kinds of exception handling mechanisms, VEH (RtlCallVectoredExceptionHandlers
), SEH(RtlpUnwindInternal
), and VCH(RtlCallVectoredContinueHandlers
). Since the isgoexception
always return false,so the first chance to run VCH never happens,and SEH is called between VEH and second VCH (the second call to RtlCallVectoredContinueHandlers
in the RtlDispatchException
). When processing SEH, the function stack needs to be Unwinded according to the current ContextRecord RIP and RSP, but due to the goroutine's RSP may not be able to satisfy the Unwind operation of windows, and thus abnormal nesting occurs (RtlpUnwindInternal
raises another uncatchable exception), causing the process to crash prematurely in SEH, and VCH will not be executed.
Reference: https://doxygen.reactos.org/d6/dea/sdk_2lib_2rtl_2amd64_2except_8c_source.html#l00087
BOOLEAN
NTAPI
RtlDispatchException(
_In_ PEXCEPTION_RECORD ExceptionRecord,
_In_ PCONTEXT ContextRecord)
{
BOOLEAN Handled;
/* Perform vectored exception handling for user mode */
if (RtlCallVectoredExceptionHandlers(ExceptionRecord, ContextRecord))
{
/* Exception handled, now call vectored continue handlers */
RtlCallVectoredContinueHandlers(ExceptionRecord, ContextRecord);
/* Continue execution */
return TRUE;
}
/* Call the internal unwind routine */
Handled = RtlpUnwindInternal(NULL, // TargetFrame
NULL, // TargetIp
ExceptionRecord,
0, // ReturnValue
ContextRecord,
NULL, // HistoryTable
UNW_FLAG_EHANDLER);
/* In user mode, call any registered vectored continue handlers */
RtlCallVectoredContinueHandlers(ExceptionRecord, ContextRecord);
return Handled;
}
I did the following test:
In the exceptionhandler
function , and after"isgoexception
" call return, modified the RIP, RBP & RSP of the exception's context
:
// ADD FOR TEST
// Read Stackbase from PEB
TEXT runtime·getStackBase(SB),NOSPLIT,$0
MOVQ 0x8(GS), AX
MOVQ AX, ret+0(FP)
RET
// ADD FOR TEST
// Read StackLimit from PEB
TEXT runtime·getStackLimit(SB),NOSPLIT,$0
MOVQ 0x10(GS), AX
MOVQ AX, ret+0(FP)
RET
...
func exceptionhandler(info *exceptionrecord, r *context, gp *g) int32 {
if !isgoexception(info, r, gp) {
// ADD FOR TEST
r.rip = 0
r.rbp = getStackBase()
r.rsp = getStackLimit()
return _EXCEPTION_CONTINUE_SEARCH
}
...
}
The behavior will meet expectations, but this modification is obviously not right for fixing this issue, and I am still working on the repair plan...
I have dumped the TIB&PEB and also the VEH&VCH before call to RaiseException(EXCEPTION_ACCESS_VIOLATION, 0,0,0)
, see the stackBase/stackLimit is differ greatly with the EXCEPTION RSP (eg. RSP=0xc000047f30):
C PEB: 0xbd31221000
C CrossProcessFlags: 0xc
C CrossProcessFlags: 0xc
C exceptionList: 0x0
C stackBase: 0xbd31600000
C stackLimit: 0xbd315fd000
ntdll: 0x00007FF9539B0000
RtlpAddVectoredHandler: 0x00007FF953A31680
RtlpAddVectoredHandler off1: 0x00007FF953A31690
RtlpAddVectoredHandler off2: 0x00007FF953A317A4
VCH&VEH list is: 0x00007FF953B2F3E8
ProcessCookie : 0x00000000F7F2A674
VEH First exception entry: 0x0000022F60D66800
VEH Last exception entry: 0x0000022F60D66800
VCH First continue entry: 0x0000022F60D66830
VCH Last continue entry: 0x0000022F60D66880
Dumping the entries:
VEH : 0x00007FF953B2F3F0
VEH handler = 0x0000022F60D60840 (exceptiontramp)
VCH : 0x00007FF953B2F408
VEH handler = 0x0000022F60D66860 (firstcontinuetramp)
VEH handler = 0x0000022F60D66AC0 (lastcontinuetramp)
Perhaps we can add an additional VEH to the current test case, always returns EXCEPTION_CONTINUE_SEARCH, which triggers the first call to VCH (before RtlpUnwindInternal
is called, prevent being interrupted by RtlpUnwindInternal
) ?
testprog/syscall_windows.go:
//go:linkname stdcall2 runtime.stdcall2
func stdcall2(fn unsafe.Pointer, a0, a1 uintptr) uintptr
//go:linkname _AddVectoredExceptionHandler runtime._AddVectoredExceptionHandler
var _AddVectoredExceptionHandler unsafe.Pointer
func lastexceptiontramp()
func lastexceptiontramphandler(info uintptr, r uintptr, gp uintptr) int32 {
return -0x1
}
func RaiseException() {
stdcall2(_AddVectoredExceptionHandler, 0, abi.FuncPCABI0(lastexceptiontramp))
const EXCEPTION_NONCONTINUABLE = 1
mod := syscall.MustLoadDLL("kernel32.dll")
proc := mod.MustFindProc("RaiseException")
proc.Call(0xbad, EXCEPTION_NONCONTINUABLE, 0, 0)
println("RaiseException should not return")
}
testprog/syscall_windows_amd64.s:
TEXT main·lastexceptiontramp(SB),NOSPLIT|NOFRAME,$0
PUSH_REGS_HOST_TO_ABI0()
ADJSP $48
MOVQ 0(CX), BX // ExceptionRecord*
MOVQ 8(CX), CX // Context*
MOVQ BX, 0(SP)
MOVQ CX, 8(SP)
MOVQ $0, 16(SP)
CALL main·lastexceptiontramphandler(SB)
MOVL 24(SP), AX
done:
ADJSP $-48
POP_REGS_HOST_TO_ABI0()
RET
Ping: this is marked as a release blocker. Any news?
@hanchaoqun, you had some suggestions on a possible fix — do you have the bandwidth to mail a CL for review?
This has still been reproducing at a fairly high rate on the windows-amd64-2012
builder specifically; curiously, it has not reproduced on other Windows builders.
greplogs --dashboard -md -l -e 'FAIL: TestRaiseException' --since=2021-11-24
2021-12-02T15:48:58-28ec0fd/windows-amd64-2012 2021-12-02T05:25:23-00dbcb3/windows-amd64-2012 2021-12-01T17:13:34-08ecdf7/windows-amd64-2012 2021-11-29T22:57:42-3ca57c7/windows-amd64-2012 2021-11-29T22:02:50-bc32dd1/windows-amd64-2012 2021-11-24T20:53:54-a3b8f62/windows-amd64-2012
As far as I understand it, this is not really a new problem. This would just have "happened" to work before and now "happens" not to work, so even though I think it's something that should be fixed, I'm not sure it's a release blocker.
@hanchaoqun - do you agree? is it a correct assumption that this is not really a new problem?
even though I think it's something that should be fixed, I'm not sure it's a release blocker.
I think that's reasonable, but if we go that route should we add a testenv.SkipFlaky
on the windows-amd64-2012
builder to reduce noise on the dashboard in the meantime?
Better still, if they test only happened to work before is there something we can do to increase the failure rate on the other builders? (IMO a skipped test with a high failure rate is more valuable than a flaky one with a low failure rate.)
even though I think it's something that should be fixed, I'm not sure it's a release blocker.
I think that's reasonable, but if we go that route should we add a
testenv.SkipFlaky
on thewindows-amd64-2012
builder to reduce noise on the dashboard in the meantime?
Sounds reasonable.
Better still, if they test only happened to work before is there something we can do to increase the failure rate on the other builders? (IMO a skipped test with a high failure rate is more valuable than a flaky one with a low failure rate.)
I can't come up with a way to increase the probability either, unfortunately. Maybe someone else has some idea?
Change https://golang.org/cl/378254 mentions this issue: testing: skip flaky TestRaiseException on windows-amd64-2012
@bufflig, @bcmills , yes, i agree this is not a new problem, I think it may be caused by the fact that the specific version of windows is not friendly enough to the go stack. So I think it is not stable to throw the exception of c in lastcontinuehandler, maybe we should throw the exception and print the stack immediately when we find the exception of c in the exceptionhandler, no longer rely on VCH, do you agree? This problem can be stably reproduced by manually adjusting the stack (position of rip and rsp)
func exceptionhandler(info *exceptionrecord, r *context, gp *g) int32 {
if !isgoexception(info, r) {
sp := unsafe.Pointer(uintptr(r.rsp))
for true {
vsp := uint64(*((*uintptr)(sp)))
if vsp < getStackBase() && vsp >= getStackLimit() {
r.rip = *((*uint64)(unsafe.Pointer(uintptr(sp) - 8)))
r.rsp = *((*uint64)(unsafe.Pointer(sp)))
}
delta := uintptr(sys.StackAlign)
sp = add(sp, delta)
if uint64(uintptr(sp)) > uint64(uintptr(add(unsafe.Pointer(r.sp()), 8*50))) ||
uint64(uintptr(sp)) >= getStackBase() {
sp = add(sp, -delta)
break
}
}
return _EXCEPTION_CONTINUE_SEARCH
}
...
D:\gobuild\golang\go1.18\src\runtime>go test -run=^TestRaiseException$ -v
=== RUN TestRaiseException
crash_test.go:124: running go build -o C:\Users\han\AppData\Local\Temp\go-build1862937472\testprog.exe
testenv.go:361: [C:\Users\han\AppData\Local\Temp\go-build1862937472\testprog.exe RaiseException] exit status: exit status 2989
syscall_windows_test.go:636: No stack trace:
--- FAIL: TestRaiseException (0.94s)
FAIL
exit status 1
FAIL runtime 0.991s
@hanchaoqun Yea, I was wondering if dumping the stack trace directly when !isgoexception was possible. I don't really know the reason we're using both exceptionhandlers and continuehandlers, but I assume there's a reason. Anyone has any knowledge of the history behind that?
@bcmills - I did add skipping the test case as flaky, should we remove the release blocker tag?
@bufflig
... I was wondering if dumping the stack trace directly when !isgoexception was possible. I don't really know the reason we're using both exceptionhandlers and continuehandlers, but I assume there's a reason. Anyone has any knowledge of the history behind that?
Do you mean printing stack trace and exiting in exceptionhandler function? I don't see how that is possible.
exceptionhandler receive ALL process exceptions. Some of these exceptions are raised and handled in external to Go code.
For example, DLL raises some exception, and then DLL exception handling code handles this exception and stops passing that exception further up the exception chain. Currently exceptions like that are ignored by Go runtime. Do you propose we crash the Go process instead?
I hope that explains well enough.
Alex
Ah, right - I see how my idea/thought would not work. Thanks!
Change https://go.dev/cl/444616 mentions this issue: runtime: skip TestVectoredHandlerExceptionInNonGoThread
Change https://go.dev/cl/451294 mentions this issue: testing: skip flaky TestRaiseException on windows-amd64-2012-*
Change https://go.dev/cl/452436 mentions this issue: testing: skip TestVectoredHandlerExceptionInNonGoThread on windows-amd64-2012-*
FYI, this looks like it is happening on the 1.19 release branch as well. Should we back-port 452436?
Yes, absolutely, we won't be able to cut a release otherwise.
@gopherbot please consider for 1.19 backport
Backport issue(s) opened: #56983 (for 1.19).
Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.
Change https://go.dev/cl/454035 mentions this issue: testing: skip flaky TestRaiseException on windows-amd64-2012-*
OK, backport CL is ready to go.
greplogs --dashboard -md -l -e 'FAIL: TestRaiseException'
2021-11-18T02:19:50-f1cc529/windows-amd64-2012
The test dates from 2014. I can find only one such faillure in the logs, and it is very recent — marking as release-blocker until we can determine whether this is a regression (CC @bufflig).