libsdl-org / SDL

Simple Directmedia Layer
https://libsdl.org
zlib License
9.86k stars 1.83k forks source link

[SDL2] testatomic occasionally segfaults #8576

Closed madebr closed 3 months ago

madebr commented 11 months ago

I noticed a segfault while running testatomic.exe on ci, using clang-cl 32-bit. In this run, testatomic.exe started execution approximately 1s after testautomation.exe. testatomic.exe and testautomation.exe were running in parallel.

Copy of the log

icculus commented 11 months ago

Hmm, this didn't even write the SDL_Log at the start of RunBasicTest (but maybe these are buffered and lost when the crash occurs?).

sezero commented 5 months ago

testsem has been occasionally segfaulting in my fork's CI runs lately. This one is the second time I've run into it: https://github.com/sezero/SDL/actions/runs/9215737866/job/25354666540

EDIT: Forgot to mention that it is happening with SDL3 branch, not SDL2. Maybe open a new ticket for it?

madebr commented 4 months ago

https://github.com/libsdl-org/SDL/actions/runs/9707802445/job/26793599201#step:11:1403

14: [SDLPROCDUMP] EXCEPTION_DEBUG_EVENT ExceptionCode: 0xc00000fd (EXCEPTION_STACK_OVERFLOW) ExceptionFlags: 0x00000000

Sadly, the generated minidumps were not uploaded, but 339d83bdf should fix that.

madebr commented 3 months ago

There's a minidump available in this ci run, but I don't know how to load it in (wine)gdb. The stacktrace shows something inside SDL3 overflowing the stack.

edit: we currently build mingw in release mode without debug information, so there's no way to do an address->source transformation

madebr commented 3 months ago

I added a for loop to MSVC ci to run the tests 100x until it failed, and catched a failure in torturethread. https://github.com/madebr/SDL/actions/runs/9890966200

But it does not make me any wiser about a SDL issue.

This is the stacktrace:

ntdll.dll!NtWaitForSingleObject()
KERNELBASE.dll!WaitForSingleObjectEx()
SDL3.dll!SDL_SYS_WaitThread(SDL_Thread * thread) Line 186
    at D:\a\SDL\SDL\src\thread\windows\SDL_systhread.c(186)
SDL3.dll!SDL_WaitThread_REAL(SDL_Thread * thread, int * status) Line 419
    at D:\a\SDL\SDL\src\thread\SDL_thread.c(419)
torturethread.exe!SDL_main(int argc, char * * argv) Line 123
    at D:\a\SDL\SDL\test\torturethread.c(123)
SDL3.dll!SDL_RunApp_REAL(int _argc, char * * _argv, int(*)(int, char * *) mainFunction, void * reserved) Line 88
    at D:\a\SDL\SDL\src\main\windows\SDL_sysmain_runapp.c(88)
[External Code]

It fails whilst waiting on the first thread here: https://github.com/libsdl-org/SDL/blob/b5a7b1ca89f4187641aa25e411e69448a56955f3/src/thread/windows/SDL_systhread.c#L183-L187

Complete output of stderr:

2024-07-11T13:19:46.2392229Z test 17
2024-07-11T13:19:46.2392821Z       Start 17: torturethread
2024-07-11T13:19:46.2393168Z 
2024-07-11T13:19:46.2394046Z 17: Test command: D:\a\SDL\SDL\build\SDL\test\Release\sdlprocdump.exe "D:/a/SDL/SDL/build/SDL/test/Release/torturethread.exe"
2024-07-11T13:19:46.2395256Z 17: Working Directory: D:/a/SDL/SDL/build/SDL/test
2024-07-11T13:19:46.2395678Z 17: Environment variables: 
2024-07-11T13:19:46.2395979Z 17:  SDL_AUDIO_DRIVER=dummy
2024-07-11T13:19:46.2396278Z 17:  SDL_VIDEO_DRIVER=dummy
2024-07-11T13:19:46.2396599Z 17:  PATH=D:/a/SDL/SDL/build/SDL/Release
2024-07-11T13:19:46.2397186Z 17: Test timeout computed to be: 30
2024-07-11T13:19:46.2578989Z 17: INFO: Creating Thread 0
2024-07-11T13:19:46.2579711Z 17: INFO: Creating Thread 1
2024-07-11T13:19:46.2584236Z 17: INFO: Creating Thread 2
2024-07-11T13:19:46.2584894Z 17: INFO: Creating Thread 3
2024-07-11T13:19:46.2595480Z 17: INFO: Creating Thread 6
2024-07-11T13:19:46.2596271Z 17: INFO: Creating Thread 5
2024-07-11T13:19:46.2596719Z 17: INFO: Creating Thread 7
2024-07-11T13:19:46.2613363Z 17: INFO: Creating Thread 8
2024-07-11T13:19:46.2621109Z 17: INFO: Creating Thread 9
2024-07-11T13:19:46.2673250Z 17: INFO: Thread '5' waiting for signal
2024-07-11T13:19:46.2738974Z 17: INFO: Thread '5' sending signals to subthreads
2024-07-11T13:19:46.2740214Z 17: INFO: Thread '2' waiting for signal
2024-07-11T13:19:46.2741259Z 17: INFO: Thread '2' sending signals to subthreads
2024-07-11T13:19:46.2741891Z 17: INFO: Thread '7' waiting for signal
2024-07-11T13:19:46.2742933Z 17: INFO: Thread '7' sending signals to subthreads
2024-07-11T13:19:46.2743428Z 17: INFO: Thread '1' waiting for signal
2024-07-11T13:19:46.2743900Z 17: INFO: Thread '1' sending signals to subthreads
2024-07-11T13:19:46.2744346Z 17: [SDLPROCDUMP] EXCEPTION_DEBUG_EVENT
2024-07-11T13:19:46.2744990Z 17: [SDLPROCDUMP]        ExceptionCode: 0xc00000fd (EXCEPTION_STACK_OVERFLOW)
2024-07-11T13:19:46.2745707Z 17: [SDLPROCDUMP]       ExceptionFlags: 0x00000000 ()
2024-07-11T13:19:46.2746197Z 17: [SDLPROCDUMP]          FirstChance: 1
2024-07-11T13:19:46.2746633Z 17: [SDLPROCDUMP]     ExceptionAddress: 0x40c67ff4
2024-07-11T13:19:46.2747555Z 17: [SDLPROCDUMP]     (Non-continuable exception debug event)
2024-07-11T13:19:46.2748112Z 17: [SDLPROCDUMP] 
2024-07-11T13:19:46.2748681Z 17: [SDLPROCDUMP] Writing minidump to "minidumps/torturethread_2024-07-11_13-19-46.dmp"

It's curious to see only threads 1, 2, 5 and 7 printed something. The other threads remain silent. This probably does not mean anything.

slouken commented 3 months ago

Ah, thank you!

Here's the stack of the failing thread:

        ...
    SDL3.dll!SDL_DelayNS_REAL(unsigned __int64 ns) Line 86  C
    SDL3.dll!SDL_LockSpinlock_REAL(int * lock) Line 168 C
    SDL3.dll!SDL_SYS_GetTLSData() Line 50   C
    SDL3.dll!SDL_GetTLS_REAL(unsigned int id) Line 39   C
    [Inline Frame] SDL3.dll!SDL_GetWaitableTimer(...) Line 42   C
    SDL3.dll!SDL_DelayNS_REAL(unsigned __int64 ns) Line 86  C
    SDL3.dll!SDL_LockSpinlock_REAL(int * lock) Line 168 C
    SDL3.dll!SDL_SYS_GetTLSData() Line 50   C
    SDL3.dll!SDL_GetTLS_REAL(unsigned int id) Line 39   C
    [Inline Frame] SDL3.dll!SDL_GetWaitableTimer(...) Line 42   C
    SDL3.dll!SDL_DelayNS_REAL(unsigned __int64 ns) Line 86  C
    SDL3.dll!SDL_LockSpinlock_REAL(int * lock) Line 168 C
    SDL3.dll!SDL_SYS_GetTLSData() Line 50   C
    SDL3.dll!SDL_GetTLS_REAL(unsigned int id) Line 39   C
    [Inline Frame] SDL3.dll!SDL_GetWaitableTimer(...) Line 42   C
    SDL3.dll!SDL_DelayNS_REAL(unsigned __int64 ns) Line 86  C
    SDL3.dll!SDL_LockSpinlock_REAL(int * lock) Line 168 C
    SDL3.dll!SDL_SYS_GetTLSData() Line 50   C
    [Inline Frame] SDL3.dll!SDL_CleanupTLS_REAL() Line 85   C
    SDL3.dll!SDL_RunThread(SDL_Thread * thread) Line 296    C
    [Inline Frame] SDL3.dll!RunThread(void *) Line 44   C
    SDL3.dll!RunThreadViaBeginThreadEx(void * data) Line 58 C
    torturethread.exe!00007ff6977b9c4a()    Unknown
    kernel32.dll!BaseThreadInitThunk() Unknown
    ntdll.dll!RtlUserThreadStart() Unknown
slouken commented 3 months ago

I'm digging into this, but fixing it turns out to be a pretty big change. I'm working on a PR now.

madebr commented 3 months ago

Is SDL2 also vulnerable to this race condition?

slouken commented 3 months ago

Yep, I'm working on cherry-picking it now.

slouken commented 3 months ago

Done!

madebr commented 3 months ago

I was just trying to hook up sdlprocdump on sdl2, and got a segfault in testatomic.

https://github.com/madebr/SDL/actions/runs/9899075063

I'm not on Windows rn, so I can't immediately look into the minidump.

log:

2024-07-11T22:07:25.2071128Z 2: Test command: D:\a\SDL\SDL\build\SDL\test\Release\sdlprocdump.exe "D:/a/SDL/SDL/build/SDL/test/Release/testatomic.exe"
2024-07-11T22:07:25.2071964Z 2: Working Directory: D:/a/SDL/SDL/build/SDL/test
2024-07-11T22:07:25.2072759Z 2: Environment variables: 
2024-07-11T22:07:25.2073916Z 2:  SDL_AUDIODRIVER=dummy
2024-07-11T22:07:25.2074556Z 2:  SDL_VIDEODRIVER=dummy
2024-07-11T22:07:25.2075473Z 2: Test timeout computed to be: 10
2024-07-11T22:07:25.2422083Z 2: [SDLPROCDUMP] EXCEPTION_DEBUG_EVENT
2024-07-11T22:07:25.2472320Z 2: [SDLPROCDUMP]        ExceptionCode: 0xc0000005 (EXCEPTION_ACCESS_VIOLATION)
2024-07-11T22:07:25.2472811Z 2: [SDLPROCDUMP]       ExceptionFlags: 0x00000000
2024-07-11T22:07:25.2473193Z 2: [SDLPROCDUMP]     ExceptionAddress: 0x00f6100a
2024-07-11T22:07:25.2473564Z 2: [SDLPROCDUMP]          FirstChance: 1
2024-07-11T22:07:25.2474061Z 2: [SDLPROCDUMP]     (Non-continuable exception debug event)
2024-07-11T22:07:25.2475739Z 2: [SDLPROCDUMP] 
2024-07-11T22:07:25.2476202Z 2: [SDLPROCDUMP] Writing minidump to "minidumps/testatomic_2024-07-11_22-07-25.dmp"
2024-07-11T22:07:33.7080668Z 2: [SDLPROCDUMP] 

I'm reopening to track this.

slouken commented 3 months ago

It looks like a stack issue again, which is strange since the code that was recursing before no longer exists.

Any chance you can hook it up to MSVC torturethread and get a stack trace with a PDB we can read?

madebr commented 3 months ago

There was a minidump available but I couldn't make sense of the crash location: at the start of SDL_main around SDL_LogSetPriority.

Modifying testatomic to do its work 10x on 10 threads made it immediately crash due to a divide by zero here. This race condition made me realize testatomic does not do SDL_Init! Perhaps that's what's causing issues?

slouken commented 3 months ago

Yes, probably, although the timer init should probably be made atomic. Let me see what I can do here.

madebr commented 3 months ago

Remove SDL_Init in tortureatomic here, and the issue should reproduce reliably.

slouken commented 3 months ago

I fixed the timer race condition in b378bc5c8. I'll do a more complete change for SDL3.

slouken commented 3 months ago

Okay, I have PRs for main and SDL2 that should take care of this. Can you take a look?

slouken commented 3 months ago

So I think this is finally fixed now. :)

sezero commented 3 months ago

Just hit this in my fork in SDL2 branch: https://github.com/sezero/SDL/actions/runs/9910933216/job/27382477447

slouken commented 3 months ago

sigh

madebr commented 3 months ago

Just hit this in my fork in SDL2 branch: https://github.com/sezero/SDL/actions/runs/9910933216/job/27382477447

That looks the same as my SDL2 crash I'm going to add sdlprocdump to SDL2 git so we at least have dumps available.

sezero commented 3 months ago

And this is in release-2.30.x branch here in this repo: https://github.com/libsdl-org/SDL/actions/runs/9910876675/job/27382301210

slouken commented 3 months ago

Just hit this in my fork in SDL2 branch: https://github.com/sezero/SDL/actions/runs/9910933216/job/27382477447

That looks the same as my SDL2 crash I'm going to add sdlprocdump to SDL2 git so we at least have dumps available.

That would be great. It would be awesome to have a stack trace for this. :)

madebr commented 3 months ago

I reproduced a crash here. The stack trace is the same as the previous one: unexpected (I don't know where it goes wrong)

testatomic.exe!SDL_main(int argc, char * * argv) Line 703
    at D:\a\SDL\SDL\test\testatomic.c(703)
testatomic.exe!main_getcmdline() Line 80
    at D:\a\SDL\SDL\src\main\windows\SDL_windows_main.c(80)
[External Code]

It's curious to see the issue only appearing with clang-cl x86

slouken commented 3 months ago

Yeah, I'm not sure what's happening here. This appears to be happening in the setup before the first SDL call in main, and it's saying the stack isn't writable.

Can we strip it down to just run that test and add a log message at the beginning so we can verify that it's getting past that?

madebr commented 3 months ago

So I added a print in SDL2main's main_getcmdline and at the start of main, and got this output:

Before SDL_main

Meaning it crashes at the start of testatomic's start, before it can do anything there.

Looking at the assembly code of testatomic's main, building in Release mode causes all functions to be inlined. It wants to allocate 0x4830 bytes on the stack, which is larger then one page (most commonly 4kiB=0x4000).

When SDL is built with -DSDL_LIBC=OFF, it also adds /Gs1048576 to disable inserting stack probes.

Checking that idea here...

slouken commented 3 months ago

Ohhhhhh, good idea. Maybe we just don't have enough stack space?

madebr commented 3 months ago

Ohhhhhh, good idea. Maybe we just don't have enough stack space?

No, we're skipping the stack probe page. That makes the mmu think we wrote to unallocated memory -> segmentation fault (EXCEPTION_ACCESS_VIOLATION). That also means chkstk for arm platforms might be more important then we realize.

madebr commented 3 months ago

Fixed in bc39952b2 I ran testatomic with this change >500x on ci and it did not crash anymore, whereas the original one crashed very often.

chkstk for SDL is not 100% important, as long as SDL does not allocate things on the stack >4kiB such that the top of the stack hops over the stack probe page.

edit: release-2.30.x

sezero commented 3 months ago

Does sdl3 need any similer attention?

madebr commented 3 months ago

Does sdl3 need any similer attention?

It already received this treatment, back when we bumped the minimum required version to 3.16 and did some rework.

sezero commented 3 months ago

Well, this happened on linux: https://github.com/libsdl-org/SDL/actions/runs/9916715732/job/27399343360

95% tests passed, 1 tests failed out of 22
Errors while running CTest
Output from these tests are in: /home/runner/work/SDL/SDL/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

Total Test time (real) =  22.09 sec

The following tests FAILED:
     13 - testsem (Failed)
Error: Process completed with exit code 8.
madebr commented 3 months ago

This is the first time I see testsem fail on Linux.

INFO: Doing 100000 contended Post/WaitTimeout operations on semaphore using 10 threads
WARN: 

Assertion failure at TestOverheadContended (/home/runner/work/SDL/SDL/test/testsem.c:232), triggered 1 time:
 '(loop_count - content_count) == 10000 * 10'

Abort/Break/Retry/Ignore/AlwaysIgnore? [abriA] : 
21/22 Test #13: testsem ..........................***Failed   12.37 sec
madebr commented 3 months ago

I've reproduced it a few times on my machine. I added a check for the return type of SDL_TryWaitSemaphore and it sometimes returns a negative value.

INFO: Doing 100000 contended Post/WaitTimeout operations on semaphore using 10 threads
INFO: SDL_TryWaitSemaphore failed (sem_timedwait returned an error: Invalid argument)
INFO: copy_loop_count = 100275
INFO: copy_content_count = 274
WARN: 

Assertion failure at TestOverheadContended (/home/maarten/programming/SDL/test/testsem.c:248), triggered 1 time:
  '(copy_loop_count - copy_content_count) == 10000 * 10'

I applied the following diff to speed up the iterations:

Patch ```patch --- a/test/CMakeLists.txt +++ b/test/CMakeLists.txt @@ -399,7 +399,13 @@ if(WIN32 AND CMAKE_SIZEOF_VOID_P EQUAL 4) endif() add_sdl_test_executable(testrendertarget NEEDS_RESOURCES TESTUTILS SOURCES testrendertarget.c) add_sdl_test_executable(testscale NEEDS_RESOURCES TESTUTILS SOURCES testscale.c) -add_sdl_test_executable(testsem NONINTERACTIVE NONINTERACTIVE_ARGS 10 NONINTERACTIVE_TIMEOUT 30 SOURCES testsem.c) +set(THREAD_COUNT 100) +add_sdl_test_executable(testsem NONINTERACTIVE NONINTERACTIVE_ARGS ${THREAD_COUNT} NONINTERACTIVE_TIMEOUT 30 SOURCES testsem.c) +add_test(NAME testsem2 COMMAND testsem ${THREAD_COUNT}) +add_test(NAME testsem3 COMMAND testsem ${THREAD_COUNT}) +add_test(NAME testsem4 COMMAND testsem ${THREAD_COUNT}) +add_test(NAME testsem5 COMMAND testsem ${THREAD_COUNT}) +add_test(NAME testsem6 COMMAND testsem ${THREAD_COUNT}) add_sdl_test_executable(testsensor SOURCES testsensor.c) add_sdl_test_executable(testshader NEEDS_RESOURCES TESTUTILS SOURCES testshader.c) add_sdl_test_executable(testshape NEEDS_RESOURCES SOURCES testshape.c ${glass_bmp_header}) diff --git a/test/testsem.c b/test/testsem.c index e041b7645..a3c83cc21 100644 --- a/test/testsem.c +++ b/test/testsem.c @@ -166,7 +166,11 @@ ThreadFuncOverheadContended(void *data) if (state->flag) { while (alive) { - if (SDL_TryWaitSemaphore(sem) == SDL_MUTEX_TIMEDOUT) { + int r = SDL_TryWaitSemaphore(sem); + if (r < 0) { + SDL_Log("SDL_TryWaitSemaphore failed (%s)", SDL_GetError()); + } + if (r == SDL_MUTEX_TIMEDOUT) { ++state->content_count; } ++state->loop_count; @@ -174,7 +178,11 @@ ThreadFuncOverheadContended(void *data) } else { while (alive) { /* Timeout needed to allow check on alive flag */ - if (SDL_WaitSemaphoreTimeout(sem, 50) == SDL_MUTEX_TIMEDOUT) { + int r = SDL_WaitSemaphoreTimeout(sem, 1); + if (r < 0) { + SDL_Log("SDL_TryWaitSemaphore failed (%s)", SDL_GetError()); + } + if (r == SDL_MUTEX_TIMEDOUT) { ++state->content_count; } ++state->loop_count; @@ -229,7 +237,13 @@ TestOverheadContended(SDL_bool try_wait) loop_count += thread_states[i].loop_count; content_count += thread_states[i].content_count; } - SDL_assert_release((loop_count - content_count) == NUM_OVERHEAD_OPS * NUM_OVERHEAD_OPS_MULT); + { + int copy_loop_count = loop_count; + int copy_content_count = content_count; + SDL_Log("copy_loop_count = %d", copy_loop_count); + SDL_Log("copy_content_count = %d", copy_content_count); + SDL_assert_release((copy_loop_count - copy_content_count) == NUM_OVERHEAD_OPS * NUM_OVERHEAD_OPS_MULT); + } duration = end_ticks - start_ticks; SDL_Log("Took %" SDL_PRIu64 " milliseconds, threads %s %d out of %d times in total (%.2f%%)\n", @@ -302,6 +316,7 @@ int main(int argc, char **argv) (void)signal(SIGTERM, killed); (void)signal(SIGINT, killed); +#if 0 if (init_sem > 0) { TestRealWorld(init_sem); } @@ -309,6 +324,7 @@ int main(int argc, char **argv) TestWaitTimeout(); TestOverheadUncontended(); +#endif TestOverheadContended(SDL_FALSE); ```

And run the tests as

ctest -R testsem --repeat until-fail:10000 --output-on-failure -j5

But it's still slow :(

madebr commented 3 months ago

@sezero https://github.com/libsdl-org/SDL/pull/10262 should fix the testsem issue

It has a 1/1000000000 (1 in a billion) chance of occurring :)