kevoreilly / capemon

capemon: CAPE's monitor
GNU General Public License v3.0
97 stars 46 forks source link

Infinite recursion from GetThreadID in Debugger #49

Closed michaelweiser closed 1 year ago

michaelweiser commented 1 year ago

Hi,

I'm seeing stack overflow exceptions on Windows 10 with even the simplest program doing a single API call:

#include <stdio.h>
#include <Windows.h>

int main()
{
    Sleep(10000);
    DWORD threadid = GetCurrentThreadId();
}

call-stack

Unfortunately, I was not able to grab any meaningful backtrace beyond it happening in enter_hook() and operate_on_backtrace(). Through single stepping the code I think to have found the root cause but can only describe it verbally with links into the capemon source code:

This in the context of __called_by_hook() means that enter_hook() was not triggered from another hook. This essentially creates potential for unwanted hook recursion whenever the SRW lock is held or that execution exception occurs during stack unwinding.

This seems to quite reliably be triggered and turned into infinite recursion by the Debugger:

This causes instantaneous inifinite hook recursion on any hooked API call (at the very least if the SRW lock is held), leading to the observed stack overflow.

To recap, the call chain is:

/any API call/ -> [recurse: enter_hook() + __called_by_hook() == 0 -> api_dispatch() -> InitNewThreadBreakpoints() -> CreateThreadBreakpoints() ->GetThreadId() -> NtQueryInformationThread()]

My workaround looks like this:

diff --git a/hooking.c b/hooking.c
index 443ae50..d1af8ef 100644
--- a/hooking.c
+++ b/hooking.c
@@ -178,7 +178,14 @@ int addr_in_our_dll_range(void *unused, ULONG_PTR addr)

 static int __called_by_hook(ULONG_PTR stack_pointer, ULONG_PTR frame_pointer)
 {
-       return operate_on_backtrace(stack_pointer, frame_pointer, NULL, addr_in_our_dll_range);
+       int rc = operate_on_backtrace(stack_pointer, frame_pointer, NULL, addr_in_our_dll_range);
+       if (rc < 0) {
+               // be cautious if we couldn't operate on the backtrace at all. This can be
+               // due to SRW lock being held or exceptions when trying to evaluate the backtrace.
+               return 1;
+       }
+
+       return rc;
 }

 int called_by_hook(void)
diff --git a/hooking_64.c b/hooking_64.c
index 153fba5..04d3230 100644
--- a/hooking_64.c
+++ b/hooking_64.c
@@ -1112,7 +1112,7 @@ BOOL srw_lock_held()
        return FALSE;
 }

-static unsigned int our_stackwalk(ULONG_PTR _rip, ULONG_PTR sp, PVOID *backtrace, unsigned int count)
+static int our_stackwalk(ULONG_PTR _rip, ULONG_PTR sp, PVOID *backtrace, unsigned int count)
 {
        /* derived from http://www.nynaeve.net/Code/StackWalk64.cpp */
        __declspec(align(64)) CONTEXT ctx;
@@ -1124,7 +1124,7 @@ static unsigned int our_stackwalk(ULONG_PTR _rip, ULONG_PTR sp, PVOID *backtrace
        unsigned int frame;

        if (srw_lock_held())
-               return 0;
+               return -1;

        __try
        {
@@ -1149,17 +1149,17 @@ static unsigned int our_stackwalk(ULONG_PTR _rip, ULONG_PTR sp, PVOID *backtrace
        }
        __except(EXCEPTION_EXECUTE_HANDLER)
        {
-               return 0;
+               return -1;
        }
 }

 int operate_on_backtrace(ULONG_PTR sp, ULONG_PTR _rip, void *extra, int(*func)(void *, ULONG_PTR))
 {
-       int ret = 0;
+       int ret = -1;
        PVOID backtrace[HOOK_BACKTRACE_DEPTH];
        lasterror_t lasterror;
-       WORD frames;
-       WORD i;
+       int frames;
+       int i;

        get_lasterrors(&lasterror);

What this does is make our_stackwalk() indicate the inability to walk the stack at all by returning -1. This will still make operate_on_backtrace() not call addr_in_our_dll_range() but the changed return code default of -1 will again indicate that fact to the caller. The only caller evaluating the return code at all is __called_by_hook(). There we now cautiously return 1, meaning "yes, we've been or at least could have been called from a hook". This successfully prevents the infinite recursion and subsequent stack overflow in my tests.

Does any of that make sense?

kevoreilly commented 1 year ago

Yes! It makes perfect sense. I now can't believe I didn't see this before. Thanks for shedding light - again, two issues here.

Looks like the underlying issue has been there for a long time - both from when I added the srw_lock_held() function and before this (a long time ago) when I added the __try() __except() by returning 0 and not considering this possibility. Not sure why it hasn't come up before, and disappointed in myself for not seeing that returning 0 has the potential for stack recursion.

While I think the CreateThreadBreakpoints() function could be improved by potentially avoiding always calling GetThreadId(), I guess the real point is that it is just triggering this underlying bug in enter_hook(), essentially bringing it to the fore where previously it lay hidden.

Your workaround could in fact be called a fix. I will implement it (or a variant thereof) and get it tested and published asap. I may try and improve CreateThreadBreakpoints() while I'm there.

A massive thanks for this work.

kevoreilly commented 1 year ago

Fix now pushed - thank you again!