libsdl-org / SDL

Simple Directmedia Layer
https://libsdl.org
zlib License
9.42k stars 1.75k forks source link

deadlock from xerror handler in steam #7975

Open zmike opened 1 year ago

zmike commented 1 year ago

When running the latest steam beta on llvmpipe, it looks like some XError handler inside SDL is causing a deadlock by attempting to call X11_XRRGetScreenResourcesCurrent inside the handler:

#0  0x00007fb11a2ab1d9 in __futex_abstimed_wait_common () from /lib64/libc.so.6
#1  0x00007fb11a2adb79 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libc.so.6
#2  0x00007fb11a9a97b3 in _XReply (dpy=0x270629d83800, rep=0x7fb0feb00f00, extra=0, discard=0) at /usr/src/debug/libX11-1.8.4-1.fc38.x86_64/src/xcb_io.c:696
#3  0x00007fb11a9232b9 in doGetScreenResources () from /lib64/libXrandr.so.2
#4  0x00007fb11af6372e in ?? () from /home/zmike/.local/share/Steam/ubuntu12_64/libSDL3.so.0
#5  0x00007fb11af4282f in ?? () from /home/zmike/.local/share/Steam/ubuntu12_64/libSDL3.so.0
#6  0x00007fb11af67743 in ?? () from /home/zmike/.local/share/Steam/ubuntu12_64/libSDL3.so.0
#7  0x00007fb11a9a78ab in _XError (dpy=dpy@entry=0x270629d83800, rep=rep@entry=0x27062c7a6bd0) at /usr/src/debug/libX11-1.8.4-1.fc38.x86_64/src/XlibInt.c:1503
#8  0x00007fb11a9a79bf in handle_error (dpy=0x270629d83800, err=0x27062c7a6bd0, in_XReply=<optimized out>)
    at /usr/src/debug/libX11-1.8.4-1.fc38.x86_64/src/xcb_io.c:211
#9  0x00007fb11a9a7a7d in handle_response (dpy=0x270629d83800, response=0x27062c7a6bd0, in_XReply=<optimized out>)
    at /usr/src/debug/libX11-1.8.4-1.fc38.x86_64/src/xcb_io.c:403
#10 0x00007fb11a9a97cd in _XReply (dpy=dpy@entry=0x270629d83800, rep=rep@entry=0x7fb0feb01240, extra=extra@entry=0, discard=discard@entry=1)
    at /usr/src/debug/libX11-1.8.4-1.fc38.x86_64/src/xcb_io.c:722
#11 0x00007fb11a9a9b55 in XSync (dpy=0x270629d83800, discard=0) at /usr/src/debug/libX11-1.8.4-1.fc38.x86_64/src/Sync.c:44
#12 0x00007fb11a8d9dea in swrastXPutImage (draw=0x27062c4c8b80, op=3, srcx=0, srcy=0, x=0, y=0, w=1280, h=832, stride=5120, shmid=1638426, 
    data=0x7fb0cffde000 "", loaderPrivate=0x27062c46a200) at ../src/glx/drisw_glx.c:210
#13 0x00007fb11a8d9f35 in swrastPutImageShm2 (draw=0x27062c4c8b80, op=3, x=0, y=0, w=1280, h=832, stride=5120, shmid=1638426, shmaddr=0x7fb0cffde000 "", 
    offset=0, loaderPrivate=0x27062c46a200) at ../src/glx/drisw_glx.c:246
#14 0x00007fb0f08ffed1 in put_image_shm (drawable=0x27062c4c8b80, shmid=1638426, shmaddr=0x7fb0cffde000 "", offset=0, offset_x=0, x=0, y=0, width=1280, 
    height=832, stride=5120) at ../src/gallium/frontends/dri/drisw.c:88
#15 0x00007fb0f0900327 in drisw_put_image_shm (drawable=0x27062c4c8b80, shmid=1638426, shmaddr=0x7fb0cffde000 "", offset=0, offset_x=0, x=0, y=0, width=1280, 
    height=832, stride=5120) at ../src/gallium/frontends/dri/drisw.c:184
#16 0x00007fb0f1515f6d in dri_sw_displaytarget_display (ws=0x27062a271720, dt=0x27062c7a6de0, context_private=0x27062c4c8b80, box=0x0)
    at ../src/gallium/winsys/sw/dri/dri_sw_winsys.c:277
#17 0x00007fb0f154812f in llvmpipe_flush_frontbuffer (_screen=0x27062a737800, _pipe=0x27062c553000, resource=0x27062c617e00, level=0, layer=0, 
    context_private=0x27062c4c8b80, sub_box=0x0) at ../src/gallium/drivers/llvmpipe/lp_screen.c:852
#18 0x00007fb0f090039d in drisw_present_texture (pipe=0x27062c553000, drawable=0x27062c4c8b80, ptex=0x27062c617e00, sub_box=0x0)
    at ../src/gallium/frontends/dri/drisw.c:196
#19 0x00007fb0f09003fd in drisw_copy_to_front (pipe=0x27062c553000, drawable=0x27062c4c8b80, ptex=0x27062c617e00) at ../src/gallium/frontends/dri/drisw.c:212
#20 0x00007fb0f09005a2 in drisw_swap_buffers (drawable=0x27062c4c8b80) at ../src/gallium/frontends/dri/drisw.c:258
#21 0x00007fb0f08ff6af in driSwapBuffers (pdp=0x27062c4c8b80) at ../src/gallium/frontends/dri/dri_util.c:849
#22 0x00007fb11a8daf19 in driswSwapBuffers (pdraw=0x27062c46a200, target_msc=0, divisor=0, remainder=0, flush=1) at ../src/glx/drisw_glx.c:770
#23 0x00007fb11a8dcd76 in glXSwapBuffers (dpy=0x270629d83800, drawable=29360140) at ../src/glx/glxcmds.c:677
#24 0x00007fb11af668c3 in ?? () from /home/zmike/.local/share/Steam/ubuntu12_64/libSDL3.so.0
#25 0x00000000004acb75 in CCompositorGLThread::Redraw() ()
#26 0x00000000004ac938 in CCompositorGLThread::Run() ()
#27 0x00000000007a159a in SteamThreadTools::CThread::ThreadExceptionWrapper(void*) ()
#28 0x000000000079eee8 in CatchAndWriteContext_t::Invoke() ()
#29 0x000000000079eac8 in CatchAndWriteMiniDump_Impl(CatchAndWriteContext_t&) ()
#30 0x000000000079ef89 in CatchAndWriteMiniDumpForVoidPtrFn ()
#31 0x00000000007a1503 in SteamThreadTools::CThread::ThreadProc(void*) ()
#32 0x00007fb11a2ae907 in start_thread () from /lib64/libc.so.6
#33 0x00007fb11a334870 in clone3 () from /lib64/libc.so.6

X11 requests can't be made from the error handler, as this will cause a deadlock.

Mesa issue: https://gitlab.freedesktop.org/mesa/mesa/-/issues/9345

icculus commented 1 year ago

Looks like whatever caught the error then tried to call the previous error handler, tossing us back into SDL, deep inside OpenGL.

That's messy. :/

icculus commented 1 year ago

No, wait, I lied, it looks like a single error triggered in glX/OpenGL, and it's calling us directly from there.

zmike commented 1 year ago

The error is from mesa, but then the SDL handler is called and deadlocks. Messy indeed.

icculus commented 1 year ago

Actually, are X11 error handlers set per-thread? I'm wondering if this background thread triggered an error, and we caught in an unexpected place because a handler we set on the main thread happened to be in effect at the time.

Specifically, this is almost certainly X11_SafetyNetErrHandler...

https://github.com/libsdl-org/SDL/blob/b3861650d359984cb4d1e2472778a8c6a6348305/src/video/x11/SDL_x11video.c#L76-L93

...which lives the entire time SDL is initialized, which is probably not a great idea, considering all our other error handlers protect tiny blocks of code. But that presumably gets set on the main thread (unless Steam did the init on a background thread before anything else got there).

It's there to try to reset the screen mode in case of disaster, but maybe that's not necessary in modern times? It certainly was in the XVidMode days.

zmike commented 1 year ago

I think they're global? Would have to double check to see any TLS usage.

I think in case of disaster the modern thing to do is let things crash and restart, but don't quote me 😬

icculus commented 1 year ago

I think in case of disaster the modern thing to do is let things crash and restart, but don't quote me

Yeah, the problem was if SDL changed the X11 display's resolution for the game and then crashed without restoring it, it would screw up the desktop (sometimes to the point where you couldn't navigate to the system settings UI to put it back manually, if the resolution was extremely low).

But again, we don't support XVidMode in SDL3, and XRandR might handle this significantly better, so this error handler might be needlessly aggressive now...if so, we should simply remove it.

@slouken, you have an opinion on this, or have an X11 expert near your desk that might have an opinion on this?

lostgoat commented 1 year ago

Maybe the event could be captured and then processed somewhere else in the message loop outside be the X11 error handler? Since X11 is async it shouldn't matter if if the event is processed slightly later.

This may be annoying if Xlib is running in synchronous mode though. I often use that + a breakpoint in steam's X11 error handler to catch the function call that is causing the error. But we could have a hint to avoid the delay and we'd set that hint when we make the call to make X11 synchronous.

Alternatively, we could just just skip this bit of code for steam. We could either stomp the error handler after we call SDL init, or we could have some mechanism in SDL to avoid these calls. Either via a hint or by keeping this safetynet dormant until a video mode is set via SDL.

Some extra context in case it is useful:

durongze commented 2 months ago

X11_XRRGetScreenResourcesCurrent 什么时候会返回空呢? 怎样才能让它不返回空?