Closed madebr closed 3 months ago
Hmm, this didn't even write the SDL_Log at the start of RunBasicTest (but maybe these are buffered and lost when the crash occurs?).
testsem has been occasionally segfaulting in my fork's CI runs lately. This one is the second time I've run into it: https://github.com/sezero/SDL/actions/runs/9215737866/job/25354666540
EDIT: Forgot to mention that it is happening with SDL3 branch, not SDL2. Maybe open a new ticket for it?
https://github.com/libsdl-org/SDL/actions/runs/9707802445/job/26793599201#step:11:1403
14: [SDLPROCDUMP] EXCEPTION_DEBUG_EVENT ExceptionCode: 0xc00000fd (EXCEPTION_STACK_OVERFLOW) ExceptionFlags: 0x00000000
Sadly, the generated minidumps were not uploaded, but 339d83bdf should fix that.
There's a minidump available in this ci run, but I don't know how to load it in (wine)gdb. The stacktrace shows something inside SDL3 overflowing the stack.
edit: we currently build mingw in release mode without debug information, so there's no way to do an address->source transformation
I added a for loop to MSVC ci to run the tests 100x until it failed, and catched a failure in torturethread. https://github.com/madebr/SDL/actions/runs/9890966200
But it does not make me any wiser about a SDL issue.
This is the stacktrace:
ntdll.dll!NtWaitForSingleObject()
KERNELBASE.dll!WaitForSingleObjectEx()
SDL3.dll!SDL_SYS_WaitThread(SDL_Thread * thread) Line 186
at D:\a\SDL\SDL\src\thread\windows\SDL_systhread.c(186)
SDL3.dll!SDL_WaitThread_REAL(SDL_Thread * thread, int * status) Line 419
at D:\a\SDL\SDL\src\thread\SDL_thread.c(419)
torturethread.exe!SDL_main(int argc, char * * argv) Line 123
at D:\a\SDL\SDL\test\torturethread.c(123)
SDL3.dll!SDL_RunApp_REAL(int _argc, char * * _argv, int(*)(int, char * *) mainFunction, void * reserved) Line 88
at D:\a\SDL\SDL\src\main\windows\SDL_sysmain_runapp.c(88)
[External Code]
It fails whilst waiting on the first thread here: https://github.com/libsdl-org/SDL/blob/b5a7b1ca89f4187641aa25e411e69448a56955f3/src/thread/windows/SDL_systhread.c#L183-L187
Complete output of stderr:
2024-07-11T13:19:46.2392229Z test 17
2024-07-11T13:19:46.2392821Z Start 17: torturethread
2024-07-11T13:19:46.2393168Z
2024-07-11T13:19:46.2394046Z 17: Test command: D:\a\SDL\SDL\build\SDL\test\Release\sdlprocdump.exe "D:/a/SDL/SDL/build/SDL/test/Release/torturethread.exe"
2024-07-11T13:19:46.2395256Z 17: Working Directory: D:/a/SDL/SDL/build/SDL/test
2024-07-11T13:19:46.2395678Z 17: Environment variables:
2024-07-11T13:19:46.2395979Z 17: SDL_AUDIO_DRIVER=dummy
2024-07-11T13:19:46.2396278Z 17: SDL_VIDEO_DRIVER=dummy
2024-07-11T13:19:46.2396599Z 17: PATH=D:/a/SDL/SDL/build/SDL/Release
2024-07-11T13:19:46.2397186Z 17: Test timeout computed to be: 30
2024-07-11T13:19:46.2578989Z 17: INFO: Creating Thread 0
2024-07-11T13:19:46.2579711Z 17: INFO: Creating Thread 1
2024-07-11T13:19:46.2584236Z 17: INFO: Creating Thread 2
2024-07-11T13:19:46.2584894Z 17: INFO: Creating Thread 3
2024-07-11T13:19:46.2595480Z 17: INFO: Creating Thread 6
2024-07-11T13:19:46.2596271Z 17: INFO: Creating Thread 5
2024-07-11T13:19:46.2596719Z 17: INFO: Creating Thread 7
2024-07-11T13:19:46.2613363Z 17: INFO: Creating Thread 8
2024-07-11T13:19:46.2621109Z 17: INFO: Creating Thread 9
2024-07-11T13:19:46.2673250Z 17: INFO: Thread '5' waiting for signal
2024-07-11T13:19:46.2738974Z 17: INFO: Thread '5' sending signals to subthreads
2024-07-11T13:19:46.2740214Z 17: INFO: Thread '2' waiting for signal
2024-07-11T13:19:46.2741259Z 17: INFO: Thread '2' sending signals to subthreads
2024-07-11T13:19:46.2741891Z 17: INFO: Thread '7' waiting for signal
2024-07-11T13:19:46.2742933Z 17: INFO: Thread '7' sending signals to subthreads
2024-07-11T13:19:46.2743428Z 17: INFO: Thread '1' waiting for signal
2024-07-11T13:19:46.2743900Z 17: INFO: Thread '1' sending signals to subthreads
2024-07-11T13:19:46.2744346Z 17: [SDLPROCDUMP] EXCEPTION_DEBUG_EVENT
2024-07-11T13:19:46.2744990Z 17: [SDLPROCDUMP] ExceptionCode: 0xc00000fd (EXCEPTION_STACK_OVERFLOW)
2024-07-11T13:19:46.2745707Z 17: [SDLPROCDUMP] ExceptionFlags: 0x00000000 ()
2024-07-11T13:19:46.2746197Z 17: [SDLPROCDUMP] FirstChance: 1
2024-07-11T13:19:46.2746633Z 17: [SDLPROCDUMP] ExceptionAddress: 0x40c67ff4
2024-07-11T13:19:46.2747555Z 17: [SDLPROCDUMP] (Non-continuable exception debug event)
2024-07-11T13:19:46.2748112Z 17: [SDLPROCDUMP]
2024-07-11T13:19:46.2748681Z 17: [SDLPROCDUMP] Writing minidump to "minidumps/torturethread_2024-07-11_13-19-46.dmp"
It's curious to see only threads 1, 2, 5 and 7 printed something. The other threads remain silent. This probably does not mean anything.
Ah, thank you!
Here's the stack of the failing thread:
...
SDL3.dll!SDL_DelayNS_REAL(unsigned __int64 ns) Line 86 C
SDL3.dll!SDL_LockSpinlock_REAL(int * lock) Line 168 C
SDL3.dll!SDL_SYS_GetTLSData() Line 50 C
SDL3.dll!SDL_GetTLS_REAL(unsigned int id) Line 39 C
[Inline Frame] SDL3.dll!SDL_GetWaitableTimer(...) Line 42 C
SDL3.dll!SDL_DelayNS_REAL(unsigned __int64 ns) Line 86 C
SDL3.dll!SDL_LockSpinlock_REAL(int * lock) Line 168 C
SDL3.dll!SDL_SYS_GetTLSData() Line 50 C
SDL3.dll!SDL_GetTLS_REAL(unsigned int id) Line 39 C
[Inline Frame] SDL3.dll!SDL_GetWaitableTimer(...) Line 42 C
SDL3.dll!SDL_DelayNS_REAL(unsigned __int64 ns) Line 86 C
SDL3.dll!SDL_LockSpinlock_REAL(int * lock) Line 168 C
SDL3.dll!SDL_SYS_GetTLSData() Line 50 C
SDL3.dll!SDL_GetTLS_REAL(unsigned int id) Line 39 C
[Inline Frame] SDL3.dll!SDL_GetWaitableTimer(...) Line 42 C
SDL3.dll!SDL_DelayNS_REAL(unsigned __int64 ns) Line 86 C
SDL3.dll!SDL_LockSpinlock_REAL(int * lock) Line 168 C
SDL3.dll!SDL_SYS_GetTLSData() Line 50 C
[Inline Frame] SDL3.dll!SDL_CleanupTLS_REAL() Line 85 C
SDL3.dll!SDL_RunThread(SDL_Thread * thread) Line 296 C
[Inline Frame] SDL3.dll!RunThread(void *) Line 44 C
SDL3.dll!RunThreadViaBeginThreadEx(void * data) Line 58 C
torturethread.exe!00007ff6977b9c4a() Unknown
kernel32.dll!BaseThreadInitThunk() Unknown
ntdll.dll!RtlUserThreadStart() Unknown
I'm digging into this, but fixing it turns out to be a pretty big change. I'm working on a PR now.
Is SDL2 also vulnerable to this race condition?
Yep, I'm working on cherry-picking it now.
Done!
I was just trying to hook up sdlprocdump on sdl2, and got a segfault in testatomic.
https://github.com/madebr/SDL/actions/runs/9899075063
I'm not on Windows rn, so I can't immediately look into the minidump.
log:
2024-07-11T22:07:25.2071128Z 2: Test command: D:\a\SDL\SDL\build\SDL\test\Release\sdlprocdump.exe "D:/a/SDL/SDL/build/SDL/test/Release/testatomic.exe"
2024-07-11T22:07:25.2071964Z 2: Working Directory: D:/a/SDL/SDL/build/SDL/test
2024-07-11T22:07:25.2072759Z 2: Environment variables:
2024-07-11T22:07:25.2073916Z 2: SDL_AUDIODRIVER=dummy
2024-07-11T22:07:25.2074556Z 2: SDL_VIDEODRIVER=dummy
2024-07-11T22:07:25.2075473Z 2: Test timeout computed to be: 10
2024-07-11T22:07:25.2422083Z 2: [SDLPROCDUMP] EXCEPTION_DEBUG_EVENT
2024-07-11T22:07:25.2472320Z 2: [SDLPROCDUMP] ExceptionCode: 0xc0000005 (EXCEPTION_ACCESS_VIOLATION)
2024-07-11T22:07:25.2472811Z 2: [SDLPROCDUMP] ExceptionFlags: 0x00000000
2024-07-11T22:07:25.2473193Z 2: [SDLPROCDUMP] ExceptionAddress: 0x00f6100a
2024-07-11T22:07:25.2473564Z 2: [SDLPROCDUMP] FirstChance: 1
2024-07-11T22:07:25.2474061Z 2: [SDLPROCDUMP] (Non-continuable exception debug event)
2024-07-11T22:07:25.2475739Z 2: [SDLPROCDUMP]
2024-07-11T22:07:25.2476202Z 2: [SDLPROCDUMP] Writing minidump to "minidumps/testatomic_2024-07-11_22-07-25.dmp"
2024-07-11T22:07:33.7080668Z 2: [SDLPROCDUMP]
I'm reopening to track this.
It looks like a stack issue again, which is strange since the code that was recursing before no longer exists.
Any chance you can hook it up to MSVC torturethread and get a stack trace with a PDB we can read?
There was a minidump available but I couldn't make sense of the crash location: at the start of SDL_main
around SDL_LogSetPriority
.
Modifying testatomic
to do its work 10x on 10 threads made it immediately crash due to a divide by zero here.
This race condition made me realize testatomic does not do SDL_Init
!
Perhaps that's what's causing issues?
Yes, probably, although the timer init should probably be made atomic. Let me see what I can do here.
Remove SDL_Init
in tortureatomic
here, and the issue should reproduce reliably.
I fixed the timer race condition in b378bc5c8. I'll do a more complete change for SDL3.
Okay, I have PRs for main and SDL2 that should take care of this. Can you take a look?
So I think this is finally fixed now. :)
Just hit this in my fork in SDL2 branch: https://github.com/sezero/SDL/actions/runs/9910933216/job/27382477447
sigh
Just hit this in my fork in SDL2 branch: https://github.com/sezero/SDL/actions/runs/9910933216/job/27382477447
That looks the same as my SDL2 crash I'm going to add sdlprocdump to SDL2 git so we at least have dumps available.
And this is in release-2.30.x branch here in this repo: https://github.com/libsdl-org/SDL/actions/runs/9910876675/job/27382301210
Just hit this in my fork in SDL2 branch: https://github.com/sezero/SDL/actions/runs/9910933216/job/27382477447
That looks the same as my SDL2 crash I'm going to add sdlprocdump to SDL2 git so we at least have dumps available.
That would be great. It would be awesome to have a stack trace for this. :)
I reproduced a crash here. The stack trace is the same as the previous one: unexpected (I don't know where it goes wrong)
testatomic.exe!SDL_main(int argc, char * * argv) Line 703
at D:\a\SDL\SDL\test\testatomic.c(703)
testatomic.exe!main_getcmdline() Line 80
at D:\a\SDL\SDL\src\main\windows\SDL_windows_main.c(80)
[External Code]
It's curious to see the issue only appearing with clang-cl x86
Yeah, I'm not sure what's happening here. This appears to be happening in the setup before the first SDL call in main, and it's saying the stack isn't writable.
Can we strip it down to just run that test and add a log message at the beginning so we can verify that it's getting past that?
So I added a print in SDL2main's main_getcmdline
and at the start of main,
and got this output:
Before SDL_main
Meaning it crashes at the start of testatomic's start, before it can do anything there.
Looking at the assembly code of testatomic's main, building in Release mode causes all functions to be inlined. It wants to allocate 0x4830 bytes on the stack, which is larger then one page (most commonly 4kiB=0x4000).
When SDL is built with -DSDL_LIBC=OFF
, it also adds /Gs1048576
to disable inserting stack probes.
Checking that idea here...
Ohhhhhh, good idea. Maybe we just don't have enough stack space?
Ohhhhhh, good idea. Maybe we just don't have enough stack space?
No, we're skipping the stack probe page. That makes the mmu think we wrote to unallocated memory -> segmentation fault (EXCEPTION_ACCESS_VIOLATION). That also means chkstk for arm platforms might be more important then we realize.
Fixed in bc39952b2 I ran testatomic with this change >500x on ci and it did not crash anymore, whereas the original one crashed very often.
chkstk for SDL is not 100% important, as long as SDL does not allocate things on the stack >4kiB such that the top of the stack hops over the stack probe page.
edit: release-2.30.x
Does sdl3 need any similer attention?
Does sdl3 need any similer attention?
It already received this treatment, back when we bumped the minimum required version to 3.16 and did some rework.
Well, this happened on linux: https://github.com/libsdl-org/SDL/actions/runs/9916715732/job/27399343360
95% tests passed, 1 tests failed out of 22
Errors while running CTest
Output from these tests are in: /home/runner/work/SDL/SDL/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
Total Test time (real) = 22.09 sec
The following tests FAILED:
13 - testsem (Failed)
Error: Process completed with exit code 8.
This is the first time I see testsem
fail on Linux.
INFO: Doing 100000 contended Post/WaitTimeout operations on semaphore using 10 threads
WARN:
Assertion failure at TestOverheadContended (/home/runner/work/SDL/SDL/test/testsem.c:232), triggered 1 time:
'(loop_count - content_count) == 10000 * 10'
Abort/Break/Retry/Ignore/AlwaysIgnore? [abriA] :
21/22 Test #13: testsem ..........................***Failed 12.37 sec
I've reproduced it a few times on my machine.
I added a check for the return type of SDL_TryWaitSemaphore
and it sometimes returns a negative value.
INFO: Doing 100000 contended Post/WaitTimeout operations on semaphore using 10 threads
INFO: SDL_TryWaitSemaphore failed (sem_timedwait returned an error: Invalid argument)
INFO: copy_loop_count = 100275
INFO: copy_content_count = 274
WARN:
Assertion failure at TestOverheadContended (/home/maarten/programming/SDL/test/testsem.c:248), triggered 1 time:
'(copy_loop_count - copy_content_count) == 10000 * 10'
I applied the following diff to speed up the iterations:
And run the tests as
ctest -R testsem --repeat until-fail:10000 --output-on-failure -j5
But it's still slow :(
@sezero https://github.com/libsdl-org/SDL/pull/10262 should fix the testsem issue
It has a 1/1000000000 (1 in a billion) chance of occurring :)
I noticed a segfault while running testatomic.exe on ci, using clang-cl 32-bit. In this run,
testatomic.exe
started execution approximately 1s aftertestautomation.exe
.testatomic.exe
andtestautomation.exe
were running in parallel.Copy of the log