runtime: failure in `TestRaiseException` on windows-amd64-2012

bcmills commented 2 years ago

--- FAIL: TestRaiseException (0.02s)
    crash_test.go:106: C:\Users\gopher\AppData\Local\Temp\1\go-build1560626842\testprog.exe RaiseException exit status: exit status 2989
    syscall_windows_test.go:636: No stack trace: 
FAIL
FAIL    runtime 68.703s

greplogs --dashboard -md -l -e 'FAIL: TestRaiseException'

2021-11-18T02:19:50-f1cc529/windows-amd64-2012

The test dates from 2014. I can find only one such faillure in the logs, and it is very recent — marking as release-blocker until we can determine whether this is a regression (CC @bufflig).

bcmills commented 2 years ago

One more this week. Looking like a regression of some sort.

greplogs --dashboard -md -l -e 'FAIL: TestRaiseException' --since=2021-11-19

2021-11-22T22:36:03-7456b94/windows-amd64-2012

bufflig commented 2 years ago

Hmmm, this might have been flaky for a while. I didn't realize it then, as I thought it was related to the crashdump change and it "went away" after a sync, but it was probably a flakiness I saw. My guess is that the recent regression is caused by changed timing due to the scavenger changes, but that's just a guess. The flakiness should be fixed anyway, I'll put it on the list.

alexbrainman commented 2 years ago

crash_test.go:106: C:\Users\gopher\AppData\Local\Temp\1\go-build1560626842\testprog.exe RaiseException exit status: exit status 2989 syscall_windows_test.go:636: No stack trace:

I spent some time staring at this. 2989 is 0xbad from

https://github.com/golang/go/blob/45bae64015975a4d7ede597a55397545571d8bc5/src/runtime/testdata/testprog/syscall_windows.go#L25

I also run runtime.TestRaiseException and it would not fail on me. I also added t.Logf in the test to print stacktrace, and I can see that when test succeeds, the child process always returns 2.

Go runtime calls runtime.exit(2) from our Windows exception handler.

I suspect in this failure runtime.exit(2) call was not reached. Either runtime crashed in exception handler, or Windows did not call our exception handler for some reason.

Alex.

bufflig commented 2 years ago

I was suspecting that something was spooky here The exception technically comes from non-go code, so I was wondering if that's simply the reason we don't get a stack trace and - for some reason I don't understand - the test for non-go-code did not come to the same conclusion before? But I could completely misunderstand it too. When I tested, an exception I caused "directly" by e.g. reading a protected memory page gave stack dumps, but when I raised the exception by calling RaiseException, it became "flaky" somewhere between 1.17 and 1.18.

alexbrainman commented 2 years ago

I was suspecting that something was spooky here The exception technically comes from non-go code, so I was wondering if that's simply the reason we don't get a stack trace and - for some reason I don't understand - the test for non-go-code did not come to the same conclusion before? But I could completely misunderstand it too.

I don't understand what you are saying. runtime.isgoexception should always return false in our scenario. Because the function only returns true if exception code is _EXCEPTION_ACCESS_VIOLATION, _EXCEPTION_INT_DIVIDE_BY_ZERO and other consts in that case statement. But we pass 0xbad value to RaiseException, and 0xbad does not match any of the consts in the case.

When I tested, an exception I caused "directly" by e.g. reading a protected memory page gave stack dumps, but when I raised the exception by calling RaiseException, it became "flaky" somewhere between 1.17 and 1.18.

I guess, your testing confirms that we do have new problem here. How do I reproduce your "flaky" testing?

Thank you.

Alex

bufflig commented 2 years ago

I was suspecting that something was spooky here The exception technically comes from non-go code, so I was wondering if that's simply the reason we don't get a stack trace and - for some reason I don't understand - the test for non-go-code did not come to the same conclusion before? But I could completely misunderstand it too.

I don't understand what you are saying. runtime.isgoexception should always return false in our scenario. Because the function only returns true if exception code is _EXCEPTION_ACCESS_VIOLATION, _EXCEPTION_INT_DIVIDE_BY_ZERO and other consts in that case statement. But we pass 0xbad value to RaiseException, and 0xbad does not match any of the consts in the case.

You're totally right and I'm clearly wrong. I think you didn't understand what I was saying because what I was saying made no sense :) I was specifically throwing EXCEPTION_ACCESS_VIOLATION in my test program, why my theory was potentially a little less out there back then...

When I tested, an exception I caused "directly" by e.g. reading a protected memory page gave stack dumps, but when I raised the exception by calling RaiseException, it became "flaky" somewhere between 1.17 and 1.18.

I guess, your testing confirms that we do have new problem here. How do I reproduce your "flaky" testing?

I just used a small program calling RaiseException through cgo:

package main

// #include <windows.h>
// #include <stdio.h>
// #include <stdlib.h>
// #include <errhandlingapi.h>
//
// static void raise() {
//   RaiseException(EXCEPTION_ACCESS_VIOLATION, 0,0,0);
// }
import "C"

import (
    "time"
)

func main() {
    time.Sleep(time.Second)
    C.raise()
    return
}

It became much less probable to succeed in dumping a stack trace during 1.18 development, but as you can see, it already needed a sleep to work to begin with, so it wasn't water-proof from the start.

However, after trying to reproduce now, I think it's a red herring, as after @randall77 's https://go-review.googlesource.com/c/go/+/364174, the program above consistently works, even without the sleep. In fact, after that change, none of my sometimes flaky tests programs trying to create exceptions in different ways ever fail to stack dump. That makes no sense to me, looking at the change itself. And also, it's completely opposite to what shows up in the builders, it rather started happening in the builders after that change, or a few days after.

So now I have no idea, not even a bad one.

Thank you.

Alex

Thanks!

Patrik

alexbrainman commented 2 years ago

I just used a small program calling RaiseException through cgo:

runtime.TestRaiseException does not uses cgo. I expect cgo and no-cgo runtimes are very different from each other. If your small cgo program works fine (dumps stack trace every time), that does not prove anything.

Alex

bufflig commented 2 years ago

Yea, probably not. I'll see if I can make it happen on a gomote with the testcase itself, to see if there's any way to reproduce.

bufflig commented 2 years ago

I managed to get some repros with some logging. Not conclusive in any way, but, as you mostly had concluded already, it does reach exceptionhandler (in signal_windows.go), which does return _EXCEPTION_CONTINUE_SEARCH, but then neither firstcontinuehandler nor lastcontinuehandler are called, as far as I can see they are not crashing, they are not called at all.

bufflig commented 2 years ago

I don't know if it's really relevant, but it appears the change that I mentioned above, 364174, also is makes this error much more probable. I've not been able to reproduce it once before that change, but by spamming the gomotes, I can reproduce within a couple of minutes after that change. I still have no clue why that change would affect exception handling, of course. It might just change the timing or something.

cherrymui commented 2 years ago

It seems that in a successful run, exceptionhandler is called first, returning as isgoexception is false, then firstcontinuehandler is called, then lastcontinuehandler is called which prints the stack trace. In a failed run, though, exceptionhandler is called and returned as usual, but firstcontinuehandler and lastcontinuehandler are not called. Instead, it exits with error code 2989 (0xbad, instead of the usual 2). I don't know why the continue handlers are not called some time.

bufflig commented 2 years ago

Yes. I don't know if the trampoline functions in asm are called though. There is some code there beyond my comprehension in sys_windows_amd64:sigtramp (no surprise there :)), also using tls, which was very vaguely related to the change 364174. It does do some stuff to make sure it's running on g0 stack, I don't know if that could be where the flakiness comes from?

cherrymui commented 2 years ago

I think sigtramp is also not called (for firstcontinuehandler and lastcontinuehandler). I added some code to print in sigtramp and I didn't see it print.

Yeah, I also cannot reproduce before CL 364174 or at tip just reverting CL 364174. Weird. That code runs very early at startup. I wonder how it could cause nondeterministic failure.

It also seems to fail more likely if I set GOGC=1. I tried disabling async preemption but it doesn't seem to have any effect.

bufflig commented 2 years ago

Ah, interesting. I tried (somewhat arbitrarily) to re-register the exception handlers before returning _EXCEPTION_CONTINUE_SERACH from exceptionhandler, that did not affect what happened at all (not sure what I expected though). It almost seems like the return value seen by windows exception handling mechanism is not seeing _EXCEPTION_CONTINUE_SEARCH, but something else?

cherrymui commented 2 years ago

It seems if I increase the 5*8 to 5*8+7 at https://cs.opensource.google/go/go/+/master:src/runtime/asm_amd64.s;l=152 (it was 4*8+7 before that CL), I can no longer reproduce the failure. Reading the code I don't see where it clobbers anything on the stack, though. Maybe there was something right below SP at the entry of rt0_go and should not be clobbered?

bufflig commented 2 years ago

Heh, there's something going on there, definitely. If I make that specific change, the cgo program that started emitting stack dumps after 364174 stopped dumping stack dumps again. The change on that line clearly affects exception handlers, but I have no idea in which way or why. Can it just be some arbitrary memory corruption going on? The stack being slightly different just makes one thing more probable than another, but in fact, something else is broken?

I'm completely confused by the switching between ABIs in sigtramp, and I don't really get what ABI is used to call the Go code. Is it called using ABI0? What makes it not being called using ABIInternal? Wrappers? I might need an ABI for dummies lesson :)

cherrymui commented 2 years ago

firstcontinuetramp is called using the system ABI. firstcontinuetramp passes the PC of the ABI0 symbol (as there is no <ABIInternal> marker) of firstcontinuehandler to sigtramp, so the ABI0 function is called. That function is actually a wrapper, which calls the actual ABIInternal Go function firstcontinuehandler, and returns its value back in ABI0.

cherrymui commented 2 years ago

I added some print and it doesn't seem we clobbered anything. For both success run and failed run the word right below SP at entry of rt0_go is 0, and in both case we stored the same value there (this store https://cs.opensource.google/go/go/+/master:src/runtime/asm_amd64.s;l=155), and the first a few words above that are never changed and the same for both cases. So it is not that we clobbered something on that stack. Still weird.

bufflig commented 2 years ago

Thanks for checking and explaining the ABI and the wrapper! At least now I understand sigtramp, which seems to do what's expected. This is indeed a weird one...

cherrymui commented 2 years ago

I also tried to change sigtramp to always return 0 (which is not always correct but is all what we need for this test) and it can still fail. So it is not the return value.

bufflig commented 2 years ago

Yea, I printed out AX at the end of sigtramp and it showed the same thing. So I can't see anyway we're returning the wrong value. Unfortunately...

I don't know what's up here. One thing is that the ones that are not run are "ContinueHandlers" and the one that's run is an "ExceptionHandlers". They are just run in sequence according to the docs I could find, but there might be some criteria for not running the continuehandlers that we hit. I couldn't find anything and the docs are sparse. MS docs I found basically say exceptionhandlers run, then if the last one returns EXCEPTION_CONTINUE_SEARCH, continuehandlers are run. But maybe there's some other criteria, like the stack has to look in a certain way (which would be kind of strange as they allegedly get their own stack frame). Maybe someone with access to the source could shed some light? @jstarks - do you by any chance have time to look at if there's some criteria that will skip continuehandlers?

Also I've had problems reproducing with the latest dev version lately, there's absolutely zero stuff that should affect this checked in lately as I can see. I will keep digging.

hanchaoqun commented 2 years ago

@cherrymui @bufflig The root cause of the problem may be this scenario: The ring3 of windows x64 has three kinds of exception handling mechanisms, VEH （RtlCallVectoredExceptionHandlers）, SEH（RtlpUnwindInternal）, and VCH（RtlCallVectoredContinueHandlers）. Since the isgoexception always return false，so the first chance to run VCH never happens，and SEH is called between VEH and second VCH (the second call to RtlCallVectoredContinueHandlers in the RtlDispatchException). When processing SEH, the function stack needs to be Unwinded according to the current ContextRecord RIP and RSP, but due to the goroutine's RSP may not be able to satisfy the Unwind operation of windows, and thus abnormal nesting occurs (RtlpUnwindInternal raises another uncatchable exception), causing the process to crash prematurely in SEH, and VCH will not be executed. Reference: https://doxygen.reactos.org/d6/dea/sdk_2lib_2rtl_2amd64_2except_8c_source.html#l00087

 BOOLEAN
 NTAPI
 RtlDispatchException(
     _In_ PEXCEPTION_RECORD ExceptionRecord,
     _In_ PCONTEXT ContextRecord)
 {
     BOOLEAN Handled;

     /* Perform vectored exception handling for user mode */
     if (RtlCallVectoredExceptionHandlers(ExceptionRecord, ContextRecord))
     {
         /* Exception handled, now call vectored continue handlers */
         RtlCallVectoredContinueHandlers(ExceptionRecord, ContextRecord);

         /* Continue execution */
         return TRUE;
     }

     /* Call the internal unwind routine */
     Handled = RtlpUnwindInternal(NULL, // TargetFrame
                                  NULL, // TargetIp
                                  ExceptionRecord,
                                  0, // ReturnValue
                                  ContextRecord,
                                  NULL, // HistoryTable
                                  UNW_FLAG_EHANDLER);

     /* In user mode, call any registered vectored continue handlers */
     RtlCallVectoredContinueHandlers(ExceptionRecord, ContextRecord);

     return Handled;
 }

I did the following test: In the exceptionhandler function , and after"isgoexception" call return, modified the RIP, RBP & RSP of the exception's context:

// ADD FOR TEST
// Read Stackbase from PEB
TEXT runtime·getStackBase(SB),NOSPLIT,$0
    MOVQ    0x8(GS), AX
    MOVQ    AX, ret+0(FP)
    RET

// ADD FOR TEST
// Read StackLimit from PEB
TEXT runtime·getStackLimit(SB),NOSPLIT,$0
    MOVQ    0x10(GS), AX
    MOVQ    AX, ret+0(FP)
    RET

...
func exceptionhandler(info *exceptionrecord, r *context, gp *g) int32 {
    if !isgoexception(info, r, gp) {
                // ADD FOR TEST
        r.rip = 0
        r.rbp = getStackBase()
        r.rsp = getStackLimit()
        return _EXCEPTION_CONTINUE_SEARCH
    }
...
}

The behavior will meet expectations, but this modification is obviously not right for fixing this issue, and I am still working on the repair plan...

hanchaoqun commented 2 years ago

I have dumped the TIB&PEB and also the VEH&VCH before call to RaiseException(EXCEPTION_ACCESS_VIOLATION, 0,0,0), see the stackBase/stackLimit is differ greatly with the EXCEPTION RSP (eg. RSP=0xc000047f30):

C PEB: 0xbd31221000
C CrossProcessFlags: 0xc 
C CrossProcessFlags: 0xc
C exceptionList: 0x0
C stackBase: 0xbd31600000
C stackLimit: 0xbd315fd000
ntdll: 0x00007FF9539B0000
RtlpAddVectoredHandler: 0x00007FF953A31680
RtlpAddVectoredHandler off1: 0x00007FF953A31690
RtlpAddVectoredHandler off2: 0x00007FF953A317A4
VCH&VEH list is: 0x00007FF953B2F3E8
ProcessCookie : 0x00000000F7F2A674
VEH First exception entry: 0x0000022F60D66800
VEH Last  exception entry: 0x0000022F60D66800
VCH First continue  entry: 0x0000022F60D66830
VCH Last  continue  entry: 0x0000022F60D66880
Dumping the entries:
VEH : 0x00007FF953B2F3F0
  VEH handler = 0x0000022F60D60840  (exceptiontramp)
VCH : 0x00007FF953B2F408
  VEH handler = 0x0000022F60D66860  (firstcontinuetramp)
  VEH handler = 0x0000022F60D66AC0 (lastcontinuetramp)

hanchaoqun commented 2 years ago

Perhaps we can add an additional VEH to the current test case, always returns EXCEPTION_CONTINUE_SEARCH, which triggers the first call to VCH (before RtlpUnwindInternal is called, prevent being interrupted by RtlpUnwindInternal) ？

testprog/syscall_windows.go:

//go:linkname stdcall2 runtime.stdcall2
func stdcall2(fn unsafe.Pointer, a0, a1 uintptr) uintptr

//go:linkname _AddVectoredExceptionHandler runtime._AddVectoredExceptionHandler
var _AddVectoredExceptionHandler unsafe.Pointer

func lastexceptiontramp()

func lastexceptiontramphandler(info uintptr, r uintptr, gp uintptr) int32 {
    return -0x1
}

func RaiseException() {
    stdcall2(_AddVectoredExceptionHandler, 0, abi.FuncPCABI0(lastexceptiontramp))
    const EXCEPTION_NONCONTINUABLE = 1
    mod := syscall.MustLoadDLL("kernel32.dll")
    proc := mod.MustFindProc("RaiseException")
    proc.Call(0xbad, EXCEPTION_NONCONTINUABLE, 0, 0)
    println("RaiseException should not return")
}

testprog/syscall_windows_amd64.s:

TEXT main·lastexceptiontramp(SB),NOSPLIT|NOFRAME,$0
    PUSH_REGS_HOST_TO_ABI0()
    ADJSP   $48

    MOVQ    0(CX), BX // ExceptionRecord*
    MOVQ    8(CX), CX // Context*
    MOVQ    BX, 0(SP)
    MOVQ    CX, 8(SP)
    MOVQ    $0, 16(SP)
    CALL    main·lastexceptiontramphandler(SB)
    MOVL    24(SP), AX
done:
    ADJSP   $-48
    POP_REGS_HOST_TO_ABI0()

    RET

heschi commented 2 years ago

Ping: this is marked as a release blocker. Any news?

bcmills commented 2 years ago

@hanchaoqun, you had some suggestions on a possible fix — do you have the bandwidth to mail a CL for review?

This has still been reproducing at a fairly high rate on the windows-amd64-2012 builder specifically; curiously, it has not reproduced on other Windows builders.

greplogs --dashboard -md -l -e 'FAIL: TestRaiseException' --since=2021-11-24

2021-12-02T15:48:58-28ec0fd/windows-amd64-2012 2021-12-02T05:25:23-00dbcb3/windows-amd64-2012 2021-12-01T17:13:34-08ecdf7/windows-amd64-2012 2021-11-29T22:57:42-3ca57c7/windows-amd64-2012 2021-11-29T22:02:50-bc32dd1/windows-amd64-2012 2021-11-24T20:53:54-a3b8f62/windows-amd64-2012

bufflig commented 2 years ago

As far as I understand it, this is not really a new problem. This would just have "happened" to work before and now "happens" not to work, so even though I think it's something that should be fixed, I'm not sure it's a release blocker.

@hanchaoqun - do you agree? is it a correct assumption that this is not really a new problem?

bcmills commented 2 years ago

even though I think it's something that should be fixed, I'm not sure it's a release blocker.

I think that's reasonable, but if we go that route should we add a testenv.SkipFlaky on the windows-amd64-2012 builder to reduce noise on the dashboard in the meantime?

Better still, if they test only happened to work before is there something we can do to increase the failure rate on the other builders? (IMO a skipped test with a high failure rate is more valuable than a flaky one with a low failure rate.)

bufflig commented 2 years ago

even though I think it's something that should be fixed, I'm not sure it's a release blocker.

I think that's reasonable, but if we go that route should we add a testenv.SkipFlaky on the windows-amd64-2012 builder to reduce noise on the dashboard in the meantime?

Sounds reasonable.

Better still, if they test only happened to work before is there something we can do to increase the failure rate on the other builders? (IMO a skipped test with a high failure rate is more valuable than a flaky one with a low failure rate.)

I can't come up with a way to increase the probability either, unfortunately. Maybe someone else has some idea?

gopherbot commented 2 years ago

Change https://golang.org/cl/378254 mentions this issue: testing: skip flaky TestRaiseException on windows-amd64-2012

hanchaoqun commented 2 years ago

@bufflig, @bcmills , yes, i agree this is not a new problem, I think it may be caused by the fact that the specific version of windows is not friendly enough to the go stack. So I think it is not stable to throw the exception of c in lastcontinuehandler, maybe we should throw the exception and print the stack immediately when we find the exception of c in the exceptionhandler, no longer rely on VCH, do you agree? This problem can be stably reproduced by manually adjusting the stack (position of rip and rsp)

func exceptionhandler(info *exceptionrecord, r *context, gp *g) int32 {
    if !isgoexception(info, r) {
        sp := unsafe.Pointer(uintptr(r.rsp))
        for true {
            vsp := uint64(*((*uintptr)(sp)))
            if  vsp < getStackBase() && vsp >= getStackLimit() {
                r.rip = *((*uint64)(unsafe.Pointer(uintptr(sp) - 8)))
                r.rsp = *((*uint64)(unsafe.Pointer(sp)))
            }
            delta := uintptr(sys.StackAlign)
            sp = add(sp, delta)
            if uint64(uintptr(sp)) > uint64(uintptr(add(unsafe.Pointer(r.sp()), 8*50))) ||
                uint64(uintptr(sp)) >= getStackBase() {
                sp = add(sp, -delta)
                break
            }
        }
        return _EXCEPTION_CONTINUE_SEARCH
    }
        ...

D:\gobuild\golang\go1.18\src\runtime>go test -run=^TestRaiseException$ -v
=== RUN   TestRaiseException
    crash_test.go:124: running go build -o C:\Users\han\AppData\Local\Temp\go-build1862937472\testprog.exe
    testenv.go:361: [C:\Users\han\AppData\Local\Temp\go-build1862937472\testprog.exe RaiseException] exit status: exit status 2989
    syscall_windows_test.go:636: No stack trace:
--- FAIL: TestRaiseException (0.94s)
FAIL
exit status 1
FAIL    runtime 0.991s

bufflig commented 2 years ago

@hanchaoqun Yea, I was wondering if dumping the stack trace directly when !isgoexception was possible. I don't really know the reason we're using both exceptionhandlers and continuehandlers, but I assume there's a reason. Anyone has any knowledge of the history behind that?

@bcmills - I did add skipping the test case as flaky, should we remove the release blocker tag?

alexbrainman commented 2 years ago

@bufflig

... I was wondering if dumping the stack trace directly when !isgoexception was possible. I don't really know the reason we're using both exceptionhandlers and continuehandlers, but I assume there's a reason. Anyone has any knowledge of the history behind that?

Do you mean printing stack trace and exiting in exceptionhandler function? I don't see how that is possible.

exceptionhandler receive ALL process exceptions. Some of these exceptions are raised and handled in external to Go code.

For example, DLL raises some exception, and then DLL exception handling code handles this exception and stops passing that exception further up the exception chain. Currently exceptions like that are ignored by Go runtime. Do you propose we crash the Go process instead?

I hope that explains well enough.

Alex

bufflig commented 2 years ago

Ah, right - I see how my idea/thought would not work. Thanks!

gopherbot commented 1 year ago

Change https://go.dev/cl/444616 mentions this issue: runtime: skip TestVectoredHandlerExceptionInNonGoThread

gopherbot commented 1 year ago

Change https://go.dev/cl/451294 mentions this issue: testing: skip flaky TestRaiseException on windows-amd64-2012-*

gopherbot commented 1 year ago

Change https://go.dev/cl/452436 mentions this issue: testing: skip TestVectoredHandlerExceptionInNonGoThread on windows-amd64-2012-*

thanm commented 1 year ago

FYI, this looks like it is happening on the 1.19 release branch as well. Should we back-port 452436?

heschi commented 1 year ago

Yes, absolutely, we won't be able to cut a release otherwise.

thanm commented 1 year ago

@gopherbot please consider for 1.19 backport

gopherbot commented 1 year ago

Backport issue(s) opened: #56983 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

gopherbot commented 1 year ago

Change https://go.dev/cl/454035 mentions this issue: testing: skip flaky TestRaiseException on windows-amd64-2012-*

thanm commented 1 year ago

OK, backport CL is ready to go.

golang / go

runtime: failure in `TestRaiseException` on windows-amd64-2012 #49681