gctest fails sometimes (List reversal produced incorrect list) with Mingw 32-bit

ivmai commented 5 years ago

Reported by @Hamayama in #260. OS: Windows 8.1 (64bit) DevTool: MSYS2/MinGW-w64 (32bit) (gcc version 7.3.0 (Rev2, Built by MSYS2 project)) Source: any (reproduced even on v7.4.16) How to reproduce: ./autogen.sh && ./configure --enable-threads=win32 --enable-large-config --disable-gcj-support && CPPFLAGS="-DDONT_ADD_BYTE_AT_END" make check COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done Output: TOTAL COUNT: 200 ERROR COUNT: 4 Reported error (one of):

'List reversal produced incorrect list - collector is broken'
'Wrong finalization data - collector is broken'
'Clobbered a leaf - collector is broken'
'Lost a node at level 3 - collector is broken'
'Lost a node at level 9 - collector is broken'
'Lost a node at level 11 - collector is broken'
'Lost a node at level 13 - collector is broken'

ivmai commented 5 years ago

Not reproduced on Linux (32 and 64 bit)

Hamayama commented 5 years ago

Uhm, I made a local branch and tried to reproduce on AppVeyor.

GC: gc master 8076a0c (latest master) DevTool: MSYS2/MinGW-w64 (32bit) (gcc version 7.3.0 (Rev2, Built by MSYS2 project))

But, I couldn't reproduce on AppVeyor.

https://github.com/Hamayama/bdwgc/tree/mingw32test

https://github.com/Hamayama/bdwgc/commit/e4e494e9ee35852466b8b6738a5c4d24fdf7282e

https://ci.appveyor.com/project/Hamayama/bdwgc/builds/21721371

-- I could reproduce on my pc only (Windows 8.1 (64bit)) .

TOTAL COUNT: 200 ERROR COUNT: 3 List reversal produced incorrect list - collector is broken

There might be some condition on my pc ...

ivmai commented 5 years ago

I have reproduced in on Win64 with i686-w64-mingw32: gcc.exe -I include -I libatomic_ops/src -O3 -Wall -Wextra -Wpedantic -D GC_THREADS -D THREAD_LOCAL_ALLOC -D PARALLEL_MARK -D LARGE_CONFIG -D DONT_ADD_BYTE_AT_END -D USE_MUNMAP -D NO_EXECUTE_PERMISSION -D GC_ATOMIC_UNCOLLECTABLE -D ENABLE_DISCLAIM -D ALL_INTERIOR_POINTERS -o gctest.exe tests/test.c extra/gc.c

TOTAL COUNT: 800 ERROR COUNT: 2 List reversal produced incorrect list - collector is broken

Most probably it could be reproduced with less number of -D options, I have not tried.

ivmai commented 5 years ago

Not easy to debug because it's observed rarely. I wonder if it could be reproduced without -D THREAD_LOCAL_ALLOC -D PARALLEL_MARK (to simplify debugging). Ideally would be good to reproduce it w/o -D GC_THREADS. Based on your report in #260, it could be reproduced without -D USE_MUNMAP. And I think -D NO_EXECUTE_PERMISSION -D GC_ATOMIC_UNCOLLECTABLE -D ENABLE_DISCLAIM do not influence too.

@Hamayama, it seems you have better luck, could you please try to reproduce it on the following configs (using master, libatomic_ops is not needed):

gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -D DONT_ADD_BYTE_AT_END -D ALL_INTERIOR_POINTERS tests/test.c extra/gc.c
gcc -g -O0 -I include -D GC_ASSERTIONS -D LARGE_CONFIG -D DONT_ADD_BYTE_AT_END -D ALL_INTERIOR_POINTERS tests/test.c extra/gc.c

If you succeed, please try also with removed -D DONT_ADD_BYTE_AT_END and -D ALL_INTERIOR_POINTERS (one of them and both).

Thank you.

Hamayama commented 5 years ago

OK. I will try them.

Hamayama commented 5 years ago

I tested gc master 8076a0c (latest master) .

OS: Windows 8.1 (64bit) DevTool: MSYS2/MinGW-w64 (32bit) (gcc version 7.3.0 (Rev2, Built by MSYS2 project))

1. gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -D DONT_ADD_BYTE_AT_END -D ALL_INTERIOR_POINTERS -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 200 ERROR COUNT: 28

2. gcc -g -O0 -I include -D GC_ASSERTIONS -D LARGE_CONFIG -D DONT_ADD_BYTE_AT_END -D ALL_INTERIOR_POINTERS -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 1000 ERROR COUNT: 0

1. + remove DONT_ADD_BYTE_AT_END gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -D ALL_INTERIOR_POINTERS -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 200 ERROR COUNT: 13

1. + remove ALL_INTERIOR_POINTERS gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -D DONT_ADD_BYTE_AT_END -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 200 ERROR COUNT: 19

1. + remove DONT_ADD_BYTE_AT_END + remove ALL_INTERIOR_POINTERS gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 200 ERROR COUNT: 27

2. + remove DONT_ADD_BYTE_AT_END gcc -g -O0 -I include -D GC_ASSERTIONS -D LARGE_CONFIG -D ALL_INTERIOR_POINTERS -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 1000 ERROR COUNT: 0

2. + remove ALL_INTERIOR_POINTERS gcc -g -O0 -I include -D GC_ASSERTIONS -D LARGE_CONFIG -D DONT_ADD_BYTE_AT_END -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 1000 ERROR COUNT: 0

2. + remove DONT_ADD_BYTE_AT_END + remove ALL_INTERIOR_POINTERS gcc -g -O0 -I include -D GC_ASSERTIONS -D LARGE_CONFIG -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 1000 ERROR COUNT: 0

Hamayama commented 5 years ago

I found that gctest becomes no error if GC_WIN32_PTHREADS is added.

1. + remove DONT_ADD_BYTE_AT_END + remove ALL_INTERIOR_POINTERS + add GC_WIN32_PTHREADS gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_WIN32_PTHREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 1000 ERROR COUNT: 0

~~It might be a hint for this issue because there are not so many GC_WIN32_PTHREADS in the source code.~~ (There are many GC_PTHREADS in the source code. still difficult ... )

ivmai commented 5 years ago

Good. Please test also: gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -D NO_WRAP_MARK_SOME -o gctest.exe tests/test.c extra/gc.c

Hamayama commented 5 years ago

I tested gc master 8076a0c (latest master) .

1. + remove DONT_ADD_BYTE_AT_END + remove ALL_INTERIOR_POINTERS + add NO_WRAP_MARK_SOME gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -D NO_WRAP_MARK_SOME -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 200 ERROR COUNT: 30 List reversal produced incorrect list - collector is broken x 20 List too long - collector is broken x 1 Assertion failure: extra/../mark_rts.c:676 x 3 Segmentation fault x 1 Unknown x 5

Hamayama commented 5 years ago

I found the following information ...

https://social.msdn.microsoft.com/Forums/vstudio/ja-JP/aa176c36-6624-4776-9380-1c9cf37a314e/getthreadcontext-returns-stale-register-values-on-wow64?forum=windowscompatibility

ivmai commented 5 years ago

+ @zachsaw @kornelpal

ivmai commented 5 years ago

How could we workaround the race issue of SuspendThread in WOW64? Use CONTEXT_EXCEPTION_REQUEST?

Hamayama commented 5 years ago

I made a following patch. It seems to repair this issue but slow. I will test more and report ...

Also, I found CHECK_NOT_WOW64 in misc.c . It seems to be able to check Wow64 by calling IsWow64Process via GetProcAddress .

patchA

--- win32_threads_orig.c    2019-01-22 07:01:45.000000000 +0900
+++ win32_threads.c 2019-01-25 09:48:11.186068400 +0900
@@ -1201,8 +1201,31 @@
     while (SuspendThread(THREAD_HANDLE(t)) == (DWORD)-1)
       Sleep(10); /* in millis */
 # else
+repeat:
     if (SuspendThread(t -> handle) == (DWORD)-1)
       ABORT("SuspendThread failed");
+
+#  if defined(I386)
+#   define CONTEXT_EXCEPTION_ACTIVE 0x08000000
+#   define CONTEXT_SERVICE_ACTIVE 0x10000000
+#   define CONTEXT_EXCEPTION_REQUEST 0x40000000
+#   define CONTEXT_EXCEPTION_REPORTING 0x80000000
+    /* Wow64 workaround */
+    CONTEXT context;
+    context.ContextFlags = CONTEXT_EXCEPTION_REQUEST;
+    if (!GetThreadContext(THREAD_HANDLE(t), &context))
+      ABORT("GetThreadContext failed");
+    if ((context.ContextFlags & CONTEXT_EXCEPTION_REPORTING)
+        && ((context.ContextFlags & CONTEXT_EXCEPTION_ACTIVE)
+            /* || (context.ContextFlags & CONTEXT_SERVICE_ACTIVE) */
+            )) {
+      if (ResumeThread(THREAD_HANDLE(t)) == (DWORD)-1)
+        ABORT("ResumeThread failed");
+      SwitchToThread();
+      goto repeat;
+    }
+#  endif /* I386 */
+
 # endif /* !MSWINCE */
   t -> suspended = (unsigned char)TRUE;
   GC_release_dirty_lock();

kornelpal commented 5 years ago

This is the correct ContextFlags check that was suggested in the previously referenced discussion: (ContextFlags & CONTEXT_EXCEPTION_REPORTING) != 0 && (ContextFlags & (CONTEXT_EXCEPTION_ACTIVE | CONTEXT_SERVICE_ACTIVE)) == 0

In the above case the thread was executing 32-bit user mode code and GetThreadContext is reliable. All other cases however require workarounds.

Both the issue and the workaround are specific to x64. GetNativeSystemInfo could be used to identify when running on the x64 architecture (and only when compiled for the x86 architecture).

Resuming the thread might seem to be a feasible workaround, but unfortunately will not work as expected when the thread is waiting (using WaitForMultipleObjects or Sleep for example) because that happens in 64-bit kernel mode and is excluded by the above check. This might also lead to a deadlock when a thread is waiting for the GC.

Eliminating waiting threads might be possible by querying the thread state using NtQuerySystemInformation and SYSTEM_THREAD_INFORMATION or using performance counters that might improve the feasibility of the ResumeThread approach.

Unfortunately however I believe that the underlying issue can only be fixed reliably in the WOW64 code because access to the 64-bit context is required and the solution depends on the specifics of the 32-bit context saving code. While the 64-bit context could be obtained using a 64-bit helper process (or using unsupported hacks in a 32-bit process), code that makes assumptions about the 32-bit context saving code is likely to break with newer versions or updates of Windows.

On the other hand, by the time execution transitions to 64-bit mode, relevant 32-bit registers were most likely saved to the stacks because x86 does not have many registers. Based on this assumption I believe that obtaining just the correct stack pointer register might be a highly effective workaround.

To avoid unnecessary complexities, my recommended workaround is performing a full stack scan that eliminates even the need for obtaining the proper value of the stack pointer register.

The NT TIB contains StackBase and StackLimit that define the thread's stack boundaries. On x86 the FS segment of each thread points to its own TIB that can be obtained using the following code:

LDT_ENTRY Selector;
if (!GetThreadSelectorEntry(Thread, Context.SegFs, &Selector))
    return FALSE;

PNT_TIB Tib = (PNT_TIB)(Selector.BaseLow | (Selector.HighWord.Bits.BaseMid << 16) | (Selector.HighWord.Bits.BaseHi << 24));

NT_TIB is defined in winnt.h of Windows SDK. This only works on Windows NT derivatives, but this workaround is not needed on any other Windows architecture either.

Hamayama commented 5 years ago

Thank you for the very important imformation!

I understood that my above patch doesn't work when the thread is waiting.

The worth of the above patch is only that it figured out the cause of this issue ... (GetThreadContext returns stale register values on WOW64)

I tested gc master cbed8d1 (latest master) .

OS: Windows 8.1 (64bit) DevTool: MSYS2/MinGW-w64 (32bit) (gcc version 7.3.0 (Rev2, Built by MSYS2 project))

1. + remove DONT_ADD_BYTE_AT_END + remove ALL_INTERIOR_POINTERS + apply patchA gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 1000 ERROR COUNT: 0

ivmai commented 5 years ago

-g -O0 -I include -D GC_ASSERTIONS

For the sake of a higher chance of catching this kind of bugs (at least you can run more test rounds in the same time), I think it is better to replace the above flags with -O3 -march=native

ivmai commented 5 years ago

To @Hamayama: how often do you observe the case the condition on ContextFlags leads to repeat? I mean how many times you see at least one repeat per thread suspend?

To @kornelpal: Do we need to call ResumeThread and SuspendThread repeatedly while the condition is not satisfied? Isn't just SwitchToThread (yield) between GetThreadContext calls sufficient?

I think we could apply a mix of the proposed approached: check the condition on ContextFlags once, if it fails (provided it typically succeeds) then scan the full stack (technically it is best to be done in GC_push_stack_for in this case). Or, alternatively, call GetThreadContext several times and fallback to the full stack scan.

Of course, we'll limit this algorithm to WOW64 after testing the prototype solution.

kornelpal commented 5 years ago

SwitchToThread yields the execution of the current thread. It is increasing the chance of the other thread gaining execution time, but it is not going to resume any suspended threads. As such, the resume-suspend-capture sequence cannot be avoided and yield makes it more likely to succeed.

Unless you are planning on implementing special support for waiting threads, I would restrict the loop to just a few iterations, because waits are considered to be unsafe by this check. I agree that the optimal number of iterations should be based on testing the actual code.

The loop should immediately fail back to the backup plan when CONTEXT_EXCEPTION_REPORTING is not set, because that means that it is not supported.

In addition to the WOW64 check, I recommend limiting this workaround to x64 only because this bug is possible because x64 has hardware x86 support and is preserving registers when switching from 32-bit compatibility mode to native 64-bit mode that is very unlikely to happen even on a future CPU architecture.

Hamayama commented 5 years ago

I made a following patch and measured counts.

patchB

--- win32_threads_orig.c    2019-01-22 07:01:45.000000000 +0900
+++ win32_threads.c 2019-01-26 12:41:48.551946600 +0900
@@ -1169,6 +1169,11 @@
 # endif
 }

+int total_call = 0;
+int loss_occur = 0;
+int repeat_count = 0;
+int repeat_sum = 0;
+
 /* Suspend the given thread, if it's still active.      */
 STATIC void GC_suspend(GC_thread t)
 {
@@ -1201,8 +1206,40 @@
     while (SuspendThread(THREAD_HANDLE(t)) == (DWORD)-1)
       Sleep(10); /* in millis */
 # else
+repeat:
     if (SuspendThread(t -> handle) == (DWORD)-1)
       ABORT("SuspendThread failed");
+
+#  if defined(I386)
+#   define CONTEXT_EXCEPTION_ACTIVE 0x08000000
+#   define CONTEXT_SERVICE_ACTIVE 0x10000000
+#   define CONTEXT_EXCEPTION_REQUEST 0x40000000
+#   define CONTEXT_EXCEPTION_REPORTING 0x80000000
+    /* Wow64 workaround */
+    CONTEXT context;
+    context.ContextFlags = CONTEXT_EXCEPTION_REQUEST;
+    if (!GetThreadContext(THREAD_HANDLE(t), &context))
+      ABORT("GetThreadContext failed");
+    if ((context.ContextFlags & CONTEXT_EXCEPTION_REPORTING)
+        && (context.ContextFlags & (CONTEXT_EXCEPTION_ACTIVE
+                                    /* | CONTEXT_SERVICE_ACTIVE */
+                                    ))) {
+      if (ResumeThread(THREAD_HANDLE(t)) == (DWORD)-1)
+        ABORT("ResumeThread failed");
+      SwitchToThread();
+      //Sleep(0);
+      //Sleep(1);
+      repeat_count++;
+      goto repeat;
+    }
+    total_call++;
+    if (repeat_count > 0) loss_occur++;
+    repeat_sum += repeat_count;
+    fprintf(stderr, "%d %d %d %d\n", total_call, loss_occur, repeat_sum, repeat_count);
+    repeat_count = 0;
+    //fflush(stderr);
+#  endif /* I386 */
+
 # endif /* !MSWINCE */
   t -> suspended = (unsigned char)TRUE;
   GC_release_dirty_lock();

1. + remove DONT_ADD_BYTE_AT_END + remove ALL_INTERIOR_POINTERS + apply patchB gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -o gctest.exe tests/test.c extra/gc.c ./gctest.exe

result:

1 0 0 0
2 0 0 0
   :
1094 4 2164 0

I repeated to run ./gctest.exe 10 times and result is as follows (last line only) .

1245 2 222 0
1125 0 0 0
1049 1 218 0
1332 3 2933 0
1155 2 1601 0
1150 2 1647 0
1123 3 1150 0
1088 2 511 0
1452 3 1474 0
1183 5 2607 0

So, total call of GC_suspend is about 1000 times and loss occurs about 0-5 times. But, one loss needs about 100-1000 repeats of ResumeThread+SuspendThread.

If I omitted ResumeThread+SuspendThread, GetThreadContext returns the same ContextFlags and gctest goes to infinite loop.

Also, If I didn't comment out CONTEXT_SERVICE_ACTIVE, gctest goes to infinite loop. (This result is different from MS forum's information. I don't know why ...)

If I commented out SwitchToThread, result is as follows.

1332 1 36 0
1151 0 0 0
1278 0 0 0
1162 1 1129 0
1369 3 16052 0
1373 3 13048 0
1140 1 5813 0
1281 3 11063 0
1123 2 10545 0
1288 3 6853 0

If I changed SwitchToThread to Sleep(0), result is as follows.

1510 1 474 0
1306 4 3141 0
1119 4 1513 0
1402 2 599 0
1120 2 281 0
1144 1 1 0
1237 3 802 0
1447 2 2014 0
1325 5 4053 0
1168 1 73 0

If I changed SwitchToThread to Sleep(1), result is as follows.

kornelpal commented 5 years ago

Please see and explanation of my understanding of this WOW64 bug and I hope that it provides some more insight into the bug and improves understanding on why CONTEXT_SERVICE_ACTIVE is needed to be excluded and why ResumeThread might improve chances of capturing a reliable thread context:

GetThreadContext always returns the user-mode context and SetThreadContext always changes the user-mode context. When a thread is in kernel-mode code, GetThreadContext returns and SetThreadContext changes the context that will be used after kernel-mode returns to user-mode that is useful for user-mode debuggers. Transitioning from 32-bit to 64-bit occurs in user-mode in WOW64 on x64. WOW64 hides this transitioning by treating 64-bit user-mode as kernel-mode for GetThreadContext and SetThreadContext purposes.

WOW64 on x64 captures the 32-bit thread context in 64-bit user-mode, and stores it in TLS (thread-local storage). This TLS slot is used by WOW64 GetThreadContext and SetThreadContext whenever the thread is in 64-bit user-mode code, or is in kernel-mode. A race condition occurs when the thread is already in 64-bit user-mod but the thread context capturing has not started or finished yet. Unfortunately this race condition is not addressed by WOW64. This results in GetThreadContext returning captured thread context from the previous WOW64 transitioning (most likely the previous system call), or a mix of the current and previous thread contexts. This also results in the changes by SetThreadContext being fully or partially overwritten by the newly saved thread context.

This race condition can be avoided by discarding GetThreadContext results when the thread is not in 32-bit user-mode. This also excludes all systems calls (64-bit kernel-mode) that is not prone to his race condition. Native x64 GetThreadContext can be used to detect when the thread really is in kernel-mode, so the exclusion can be narrowed down to just 64-bit user-mode that is the WOW64 transitioning layer and thus would minimize the impact of such a workaround. This however requires an x64 helper executable and might introduce too much complexity.

ResumeThread gives the thread a chance to continue execution past the bug-prone window and thus improves the chance of capturing a reliable thread context after suspending the thread again. Unfortunately this potential is severely limited when being in actual kernel-mode is treated as unsafe too, because the thread is more likely to be in actual kernel-mode than in the problematic WOW64 transition code that is also reported as being in kernel-mode by WOW64. In my opinion, the resume-suspend-capture approach has more potential when used with a native x64 helper process providing access to the native x64 GetThreadContext that enables identifying when the thread is in 64-bit user-mode WOW64 code.

CONTEXT_EXCEPTION_REQUEST requests information about kernel-mode state. When supported, CONTEXT_EXCEPTION_REPORTING is set, and relevant information in provided. When CONTEXT_EXCEPTION_REPORTING is not set, the absence of relevant flags does not mean that the tread is in user-mode and it is safe to assume that this feature is not supported. CONTEXT_EXCEPTION_ACTIVE is set when a trap (CPU exception) handler is active in kernel-mode. CONTEXT_SERVICE_ACTIVE is set when a system call (kernel-mode service for user-mode) is active in kernel-mode. The presence of any of these means that the thread is in kernel-mode. On WOW64 these also may mean that the thread is in 64-bit user-mode WOW64 code. Unfortunately I'm not sure whether the absence of these two flags also means that the thread is in user-mode, but according to the reply from Microsoft in the previously referenced forum thread, no other cases are affected by this bug.

By omitting CONTEXT_SERVICE_ACTIVE, the workaround is not applied to a fairly large portion of cases affected by this bug. Because waits are system calls, the experienced infinite loop might indicate that the thread is waiting for the GC to complete and the GC is waiting for the thread to stop waiting that results in a deadlock.

ivmai commented 5 years ago

To @Hamayama:

Also, If I didn't comment out CONTEXT_SERVICE_ACTIVE, gctest goes to infinite loop.

Please check how many times you observe the condition is satisfied after the first call of SuspendThread: (ContextFlags & CONTEXT_EXCEPTION_REPORTING) != 0 && (ContextFlags & (CONTEXT_EXCEPTION_ACTIVE | CONTEXT_SERVICE_ACTIVE)) == 0

I think if this condition is satisfied 99% after the 1st call of SuspendThread then we could just once check the condition (with CONTEXT_SERVICE_ACTIVE) and in case of failure we could fall back to scanning of the whole stack obtained by GetThreadSelectorEntry.

Hamayama commented 5 years ago

I made a following patch and measured counts.

patchC

--- win32_threads_orig.c    2019-01-22 07:01:45.000000000 +0900
+++ win32_threads.c 2019-01-27 09:55:30.917743000 +0900
@@ -1169,6 +1169,9 @@
 # endif
 }

+int total_call = 0;
+int ok_count = 0;
+
 /* Suspend the given thread, if it's still active.      */
 STATIC void GC_suspend(GC_thread t)
 {
@@ -1203,6 +1206,27 @@
 # else
     if (SuspendThread(t -> handle) == (DWORD)-1)
       ABORT("SuspendThread failed");
+
+#  if defined(I386)
+#   define CONTEXT_EXCEPTION_ACTIVE 0x08000000
+#   define CONTEXT_SERVICE_ACTIVE 0x10000000
+#   define CONTEXT_EXCEPTION_REQUEST 0x40000000
+#   define CONTEXT_EXCEPTION_REPORTING 0x80000000
+    /* Wow64 workaround */
+    CONTEXT context;
+    context.ContextFlags = CONTEXT_EXCEPTION_REQUEST;
+    if (!GetThreadContext(THREAD_HANDLE(t), &context))
+      ABORT("GetThreadContext failed");
+    if ((context.ContextFlags & CONTEXT_EXCEPTION_REPORTING) != 0
+        && (context.ContextFlags & (CONTEXT_EXCEPTION_ACTIVE
+                                    | CONTEXT_SERVICE_ACTIVE)) == 0) {
+      ok_count++;
+    }
+    total_call++;
+    fprintf(stderr, "%d %d\n", total_call, ok_count);
+    //fflush(stderr);
+#  endif /* I386 */
+
 # endif /* !MSWINCE */
   t -> suspended = (unsigned char)TRUE;
   GC_release_dirty_lock();

1. + remove DONT_ADD_BYTE_AT_END + remove ALL_INTERIOR_POINTERS + apply patchC gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -o gctest.exe tests/test.c extra/gc.c ./gctest.exe

result:

I repeated to run ./gctest.exe 10 times and result is as follows (last line only) .

If I commented out CONTEXT_SERVICE_ACTIVE, result is as follows.

ivmai commented 5 years ago

1160 88

This means only SuspendThread completes "normally" only in 1% (e.g. because the situation of a thread is waiting for the GC to complete is frequent), it is not guaranteed that the registers are saved in other cases. But the required registers seem to be saved (at least gctest is not crashed) in 99.9%.

ivmai commented 5 years ago

To @Hamayama: could you please also check that we can use GetThreadSelectorEntry-based approach (instead of scanning the registers) in GC_push_stack_for?

Hamayama commented 5 years ago

To @Hamayama: could you please also check that we can use GetThreadSelectorEntry-based approach (instead of scanning the registers) in GC_push_stack_for?

I don't know I can do it but I will try.

By the way, I modified MS forum's sample code for MSYS2/MinGW-w64 32bit environment. https://gist.github.com/Hamayama/da2e08996ce0bc3412562d216172bb05

result is as follows.

total_call=1000000 loss_occur=642 repeat_sum=5315601

If I commented out only #define USE_WOW64_WORKAROUND, result is as follows.

ERROR: Stale ESP ! (Context ESP: 0xa9ff6c, Saved ESP: 0xa9ff5c, Previous ESP: 0xa9ff74)

If I commented out only #define USE_RESUME_SUSPEND_REPEAT, result is infinite loop (using cpu and never finished).

If I commented out only #define USE_CONTEXT_EXCEPTION_ACTIVE, result is as follows.

ERROR: Stale ESP ! (Context ESP: 0x9dff70, Saved ESP: 0x9dff6c, Previous ESP: 0x9dff78)

If I commented out only #define USE_CONTEXT_SERVICE_ACTIVE, result is as follows.

total_call=1000000 loss_occur=278 repeat_sum=509330

So, CONTEXT_SERVICE_ACTIVE seems to be not necessary at least for 1000000 times call of GetThreadContext.

Is there any concrete example that breaks the context if CONTEXT_SERVICE_ACTIVE is omitted?

I understand we can't use ResumeThread+SuspendThread method because it doesn't work when the thread is waiting.

But if we use GetThreadContext only to check the context is available, CONTEXT_SERVICE_ACTIVE seems to affect quite perfomance of GC.

kornelpal commented 5 years ago

Thank you @Hamayama for looking into CONTEXT_SERVICE_ACTIVE and for the detailed explanation, unfortunately I was unaware of this and haven't tested this previously. I did some more testing and I was unable find a case when CONTEXT_SERVICE_ACTIVE is set and stale registers are returned either.

When a thread is in kernel-mode executing a system call, CONTEXT_SERVICE_ACTIVE is supposed to be set and CONTEXT_EXCEPTION_ACTIVE is not supposed to be set. You seem to have uncovered another bug in WOW64. If it gets fixed in a future release without fixing the stale context bug then a workaround only checking for CONTEXT_EXCEPTION_ACTIVE will stop working.

Note that I've tested and waiting threads will correctly have CONTEXT_SERVICE_ACTIVE set and CONTEXT_EXCEPTION_ACTIVE not set, so waits are not going to affect the workaround when only CONTEXT_EXCEPTION_ACTIVE is checked for.

To be fast and future proof at the same time, I would do the following:

Retry only when CONTEXT_EXCEPTION_ACTIVE is set and CONTEXT_SERVICE_ACTIVE is not set.
Have some upper limit on the number retries to avoid infinite loops, because you cannot be sure how much (if any) execution time the thread gets and what it will do between resuming and suspending it again.
Fall back to full stack scan when the loop failed to capture a reliable context that had neither CONTEXT_EXCEPTION_ACTIVE nor CONTEXT_SERVICE_ACTIVE set.
Even when performing a full stack scan, scan all the registers as usual, because they might contain up-to-date values.

If a full stack scan is too expensive then ignoring CONTEXT_SERVICE_ACTIVE might not ever become an issue, because fixing the stale context bug does not seem to be a priority, so chances are that CONTEXT_EXCEPTION_ACTIVE will not be changed to CONTEXT_SERVICE_ACTIVE anytime soon either.

Also note that went through my previous tests, and I found that both ECX and EDX are set to 0 when in WOW64 or in kernel-mode, that might be useful if you want to add some basic support for older Windows versions that have no support for CONTEXT_EXCEPTION_REQUEST in WOW64.

ivmai commented 5 years ago

As understand from the above discussion, we could use the following scenario for I386: if host is WOW64 then check for CONTEXT_EXCEPTION_ACTIVE is set and CONTEXT_SERVICE_ACTIVE is not set just once, if not then use StackLimit instead of SP (i.e. register are scanned any way).

I don't know I can do it but I will try.

@Hamayama, I will try to do it today to (i.e. check whether gctest works if we use StackLimit instead of SP (even without register scan)).

Hamayama commented 5 years ago

OK!

I noticed that ContextFlags needs CONTEXT_SEGMENTS flag to get context.SegFs via GetThreadContext. The code fragment is like as follows. (This might not work ...)

    CONTEXT context;
    context.ContextFlags = CONTEXT_INTEGER|CONTEXT_CONTROL|CONTEXT_SEGMENTS;
    if (!GetThreadContext(THREAD_HANDLE(thread), &context))
      ABORT("GetThreadContext failed");

    LDT_ENTRY selector;
    if (!GetThreadSelectorEntry(THREAD_HANDLE(thread),
                                context.SegFs,
                                &selector))
      ABORT("GetThreadSelectorEntry failed");
    PNT_TIB tib = (PNT_TIB)(selector.BaseLow
                            | (selector.HighWord.Bits.BaseMid << 16)
                            | (selector.HighWord.Bits.BaseHi << 24));
    //fprintf(stderr, "%x %x\n", tib->StackLimit, tib->StackBase);
    //fflush(stderr);
    //ABORT("XXXXX");
    sp = (ptr_t)tib->StackLimit;
    //PUSH4(Edi,Esi,Ebx,Edx), PUSH2(Ecx,Eax), PUSH1(Ebp);
    //sp = (ptr_t)context.Esp;

Hamayama commented 5 years ago

I made a following patch as a prototype.

patchD

--- win32_threads_orig.c    2019-01-25 17:42:43.000000000 +0900
+++ win32_threads.c 2019-01-31 00:58:05.606638400 +0900
@@ -1389,6 +1389,30 @@
           && !(last_info.Protect & PAGE_GUARD);
 }

+#if defined(I386)
+/* Wow64 workaround */
+static int is_wow64()
+{
+  static int wow64_flag = -1;
+
+  if (wow64_flag == -1) {
+    wow64_flag = 0;
+    HMODULE hK32 = GetModuleHandle(TEXT("kernel32.dll"));
+    if (hK32) {
+      FARPROC pfn = GetProcAddress(hK32, "IsWow64Process");
+      BOOL bIsWow64 = FALSE;
+      if (pfn
+          && (*(BOOL (WINAPI*)(HANDLE, BOOL*))pfn)(GetCurrentProcess(),
+                                                   &bIsWow64)
+          && bIsWow64) {
+        wow64_flag = 1;
+      }
+    }
+  }
+  return wow64_flag;
+}
+#endif
+
 STATIC word GC_push_stack_for(GC_thread thread, DWORD me)
 {
   ptr_t sp, stack_min;
@@ -1402,7 +1426,22 @@
               /* Use saved sp value for blocked threads. */
     /* For unblocked threads call GetThreadContext().   */
     CONTEXT context;
-    context.ContextFlags = CONTEXT_INTEGER|CONTEXT_CONTROL;
+#   if defined(I386)
+#    if !defined(CONTEXT_EXCEPTION_ACTIVE)
+#     define CONTEXT_EXCEPTION_ACTIVE    0x08000000
+#     define CONTEXT_SERVICE_ACTIVE      0x10000000
+#     define CONTEXT_EXCEPTION_REQUEST   0x40000000
+#     define CONTEXT_EXCEPTION_REPORTING 0x80000000
+#    endif /* !CONTEXT_EXCEPTION_ACTIVE */
+    /* Wow64 workaround */
+    if (is_wow64())
+      context.ContextFlags = CONTEXT_INTEGER|CONTEXT_CONTROL
+                             | CONTEXT_EXCEPTION_REQUEST
+                             | CONTEXT_SEGMENTS;
+    else
+#   endif /* I386 */
+    /* else */
+      context.ContextFlags = CONTEXT_INTEGER|CONTEXT_CONTROL;
     if (!GetThreadContext(THREAD_HANDLE(thread), &context))
       ABORT("GetThreadContext failed");

@@ -1413,8 +1452,29 @@
 #   define PUSH2(r1,r2) (PUSH1(r1), PUSH1(r2))
 #   define PUSH4(r1,r2,r3,r4) (PUSH2(r1,r2), PUSH2(r3,r4))
 #   if defined(I386)
-      PUSH4(Edi,Esi,Ebx,Edx), PUSH2(Ecx,Eax), PUSH1(Ebp);
-      sp = (ptr_t)context.Esp;
+      /* Wow64 workaround */
+      if (is_wow64()
+          && (context.ContextFlags & CONTEXT_EXCEPTION_REPORTING)
+          && (context.ContextFlags & (CONTEXT_EXCEPTION_ACTIVE
+                                      /* | CONTEXT_SERVICE_ACTIVE */))) {
+        LDT_ENTRY selector;
+        if (!GetThreadSelectorEntry(THREAD_HANDLE(thread),
+                                    context.SegFs,
+                                    &selector))
+          ABORT("GetThreadSelectorEntry failed");
+        PNT_TIB tib = (PNT_TIB)(selector.BaseLow
+                                | (selector.HighWord.Bits.BaseMid << 16)
+                                | (selector.HighWord.Bits.BaseHi << 24));
+        //fprintf(stderr, "stack1: %x %x\n", tib->StackLimit, tib->StackBase);
+        //fprintf(stderr, "stack2: %x %x\n", context.Esp, thread->stack_base);
+        //fprintf(stderr, "data1: %x %x\n", (DWORD*)(tib->StackLimit),   *((DWORD*)(tib->StackLimit)));
+        //fprintf(stderr, "data2: %x %x\n", (DWORD*)(tib->StackLimit)+1, *((DWORD*)(tib->StackLimit)+1));
+        //fflush(stderr);
+        sp = (ptr_t)tib->StackLimit;
+      } else {
+        PUSH4(Edi,Esi,Ebx,Edx), PUSH2(Ecx,Eax), PUSH1(Ebp);
+        sp = (ptr_t)context.Esp;
+      }
 #   elif defined(X86_64)
       PUSH4(Rax,Rcx,Rdx,Rbx); PUSH2(Rbp, Rsi); PUSH1(Rdi);
       PUSH4(R8, R9, R10, R11); PUSH4(R12, R13, R14, R15);

I tested gc master 88338a5 (latest master) .

OS: Windows 8.1 (64bit) DevTool: MSYS2/MinGW-w64 (32bit) (gcc version 7.3.0 (Rev2, Built by MSYS2 project))

1. + remove DONT_ADD_BYTE_AT_END + remove ALL_INTERIOR_POINTERS + apply patchD gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done

result: TOTAL COUNT: 1000 ERROR COUNT: 0

If I uncommented fprintf and fflush, result is as follows.

./gctest.exe
stack1: 243d000 2440000
stack2: 243fcf0 243ff50
data1: 243d000 0
data2: 243d004 0
stack1: 241a000 2440000
stack2: 243fdb8 243ff50
data1: 241a000 0
data2: 241a004 0
stack1: 52fe000 5300000
stack2: 52fe000 52fff50
data1: 52fe000 74b1b2e4
data2: 52fe004 52fe070
stack1: 289a000 28c0000
stack2: 28bf7b0 28bff50
data1: 289a000 0
data2: 289a004 0
stack1: 267d000 2680000
stack2: 267fe60 267ff50
data1: 267d000 0
data2: 267d004 0

ivmai commented 5 years ago

Thank you, guys, very good! I'll polish the patch, add comments and push to master.

kornelpal commented 5 years ago

This version of the workaround really is simple and efficient, good job!

I would however still PUSH the registers, even when the workaround is activated, because those might be up-to-date.

ivmai commented 5 years ago

I would however still PUSH the registers

Sure. The above patch skips regs pushing for a testing sake only.

ivmai commented 5 years ago

I will push the commit this week.

ivmai commented 5 years ago

To @Hamayama: please test the latest master (with repetitive test "1." and "make check").

Hamayama commented 5 years ago

I tested gc master 9483d5b (latest master) .

OS: Windows 8.1 (64bit) DevTool: MSYS2/MinGW-w64 (32bit) (gcc version 7.3.0 (Rev2, Built by MSYS2 project))

testA:

./autogen.sh
./configure --enable-threads=win32 --enable-large-config
make check

result: TOTAL: 16 PASS: 16

testB:

COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done

result: TOTAL COUNT: 1000 ERROR COUNT: 0

1. gcc -g -O0 -I include -D GC_ASSERTIONS -D GC_THREADS -D GC_BUILTIN_ATOMIC -D LARGE_CONFIG -D DONT_ADD_BYTE_AT_END -D ALL_INTERIOR_POINTERS -o gctest.exe tests/test.c extra/gc.c COUNT=0; ERRCNT=0; while :; do ./gctest.exe || ((ERRCNT++)); ((COUNT++)); echo $COUNT $ERRCNT; done TOTAL COUNT: 1000 ERROR COUNT: 0

It's all OK. Thanks a lot!

ivmai / bdwgc

gctest fails sometimes (List reversal produced incorrect list) with Mingw 32-bit #262