Closed smcv closed 2 years ago
Here's the patch I tried to revert this change, which makes this test pass again: https://github.com/smcv/SDL/commit/6677d176dc5b5885700af353d437d63d4b4860a7
The osk-sdl event loop as of 0.66 appears to be a busy-loop, which doesn't seem great to me (but it's not my code and maybe there is a good reason why it has to be):
while (luksDev.isLocked() && !done) {
show_osk = !keyboardToggle.isVisible();
if (SDL_WaitEvent(&event)) {
// an event was found
switch (event.type) {
// handle the keyboard
case SDL_KEYDOWN:
...
This seems to avoid the regression, although I don't know whether it's correct. (Indentation deliberately not adjusted to highlight what the significant change is)
diff --git a/src/events/SDL_events.c b/src/events/SDL_events.c
index f61075eeb..84d7b51af 100644
--- a/src/events/SDL_events.c
+++ b/src/events/SDL_events.c
@@ -954,6 +954,7 @@ SDL_WaitEventTimeout(SDL_Event * event, int timeout)
Uint32 expiration = 0;
/* First check for existing events */
+ if (timeout == 0) {
switch (SDL_PeepEvents(event, 1, SDL_GETEVENT, SDL_FIRSTEVENT, SDL_LASTEVENT)) {
case -1:
return 0;
@@ -967,6 +968,7 @@ SDL_WaitEventTimeout(SDL_Event * event, int timeout)
/* Has existing events */
return 1;
}
+ }
if (timeout > 0) {
start = SDL_GetTicks();
I think I have an idea of what is happening:
SDL_PushEvent
pushing a custom "render event" type into the queue.
SDL_PumpEvents
(which adds the input events to the queue) was only called after the render event was added (from SDL_WaitEvent
).WaitEvents
doesn't need to call SDL_PumpEvents
at all.SDL_PumpEvents
. Therefore render events pushed by the user are the only events ever pushed, and keep SDL 2.0.18 from pumping in SDL_WaitEvent
.I don't think this use case aligns with the intention of SDL's current API, though the discrepancies should probably be documented:
SDL_PushEvent
queues the event behind queued events, which are only guaranteed as recent as the last call to SDL_PumpEvents
.
SDL_PushEvent
(which is currently thread-safe) to trigger a call to SDL_PumpEvents
(which should currently only be called by the "main" thread that initialized the video subsystem).SDL_WaitEvents
and SDL_WaitEventsTimeout
only call SDL_PumpEvents
when the event queue is empty.
SDL_WaitEventTimeout
with timeout != 0
(which leaves SDL_Poll
incompatible for this usage scenario).EDIT: created downstream issue 138.
EDIT 2: On second thought, maybe it would just work if we generalized the polling sentinel solution from timeout == 0
to timeout <= 0
, when peeping and inserting.
Additionally for timeout <= 0
, we wouldn't want to return early but fall through to the later sections that implement the waiting for events.
But maybe I'm missing some issue that would introduce.
EDIT 3: edit 2 would address a separate issue and doesn't yet solve this. We would need to repurpose the polling sentinel even more:
WaitEventTimeout
, call PumpEvents
, set a "did_pump_events" flag, and insert the sentinel eventWaitEventTimeout
, check the "did_pump_events" flag and don't insert the sentinel (so we don't flood the queue with them)There are three modes:
SDL_PollEvent
)SDL_WaitEvent
)I think perhaps ideally we only want to enter the First check for existing events
block if there is a sentinel already queued, and not if the queue contains only non-sentinels?
My opinion is that we should just document this, especially if this is the only known application affected.
The optimization of avoiding SDL_PumpEvents()
is a nice performance uplift across the board, regardless of whether an application is doing SDL_PollEvent()
or SDL_WaitEvent()
. Doing a bunch of syscalls each time the application wants to get an event sucks, especially for high frequency input devices (a main target of #4794). It would be a shame to lose it in some/all cases due to a single broken application.
Normally I'd want to strongly avoid breaking applications, but in this particular case the application is using a pretty unusual event processing technique and the optimization has benefits for all SDL users.
I've just tested the proposed fix for the problematic application (osk-sdl) by adding a call to SDL_PumpEvents()
into it's main loop. This resolved the deadlock, but SDL_MOUSEBUTTONDOWN
and SDL_MOUSEBUTTONUP
events still do not trigger. Some others, such as SDL_WINDOWEVENT
and SDL_KEYDOWN
seem to work fine.
This is the case when built on Debian Sid (SDL 2.0.18) but works fine when built on Debian Bullseye (2.0.16).
Its reasoning was that "as long as there's something in the event queue, WaitEvents doesn't need to call SDL_PumpEvents at all.
Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event" instead of "WaitEvents doesn't need to call SDL_PumpEvents again until the event queue is empty."
This still solves the original problem (calling PumpEvents every time you poll/block for the next event in the queue), but should also fix this issue (be in a position where PumpEvents doesn't get called at all).
Am I way oversimplifying (or simply misunderstanding) the problem?
Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event"
I think so, yes.
If I'm understanding the intention of #4794 correctly, the problem was that high-frequency input devices would enqueue events every time SDL_PumpEvents
is called, starving other event sources; and so you don't want to call SDL_PumpEvents
until all the events generated by the previous SDL_PumpEvents
have been dispatched to the library user (as indicated by reaching the sentinel). Is that right?
What's happening in osk-sdl is almost the other way round: every time osk-sdl processes one of its own "render events", it enqueues another render event, starving the event source that should be talking to X11. Like this:
SDL_PumpEvents
is never reachedIf we conditionally skipped the same block as in https://github.com/libsdl-org/SDL/issues/5136#issuecomment-1000939687, but instead of the condition for running it being if (timeout == 0)
, the condition was if (there is a sentinel event in the queue)
, then I think that would solve osk-sdl's problem:
SDL_PumpEvents
polls X11 and enqueues 0 or more input events, etc.To achieve that, I think it might be necessary to adjust some of the locking, so that the checks in the block affected by https://github.com/libsdl-org/SDL/issues/5136#issuecomment-1000939687 are all done within a single lock/unlock.
Also, if there is a sentinel but it is at the head of the queue, I think probably we want to fall through to starting a new poll cycle immediately, rather than considering the next event? Otherwise, osk-sdl could do this:
SDL_PumpEvents
polls X11 and enqueues 0 or more input events, etc.Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event"
I need to look at this a bit more carefully, but before the poll sentinel was added, wouldn't the event queue just continue to grow as the application falls further and further behind? Also the poll sentinel commit did add a hint to disable them (which just uses the regular API for disabling events).
Also the poll sentinel commit did add a hint to disable them
That hint doesn't disable the fast-path when the queue is non-empty, only the sentinel event, so it doesn't help osk-sdl.
I don't quite follow how the scenario in @Undef-a 's report above manifested:
I've just tested the proposed fix for the problematic application (osk-sdl) by adding a call to
SDL_PumpEvents()
into it's main loop. This resolved the deadlock, butSDL_MOUSEBUTTONDOWN
andSDL_MOUSEBUTTONUP
events still do not trigger. Some others, such asSDL_WINDOWEVENT
andSDL_KEYDOWN
seem to work fine.This is the case when built on Debian Sid (SDL 2.0.18) but works fine when built on Debian Bullseye (2.0.16).
If manual pumping doesn't fix it, I don't see how adapting SDL_WaitEventTimeout
would fix it either (unless I'm missing something) - is there another separate issue at play?
@smcv Have you attempted patching osk-sdl and encountered this as well?
Can you try this approach? https://github.com/libsdl-org/SDL/pull/5187
This is fixed in the latest commits.
After updating the SDL package in Debian from 2.0.16 to 2.0.18 (+ selected patches recommended by @slouken), an osk-sdl unit test that previously passed has started failing, blocking the "migration" process in Debian (the effect is similar to blocking the SDL update's inclusion in Debian betas).
I am a packager/maintainer of SDL in Debian, but I have no special knowledge of osk-sdl, which I installed for the first time in a test VM today to try to get an easier reproducer / more information.
Bisecting SDL indicates that commit 8bf32e12 "Improved SDL_PollEvent usage (#4794)" is the first one to fail. Reverting that commit (which requires a bit of conflict resolution to account for subsequent changes) makes the test pass again.
To reproduce:
osk-sdl
and SDL, plusXvfb
,xauth
and Debian'sxvfb-run
script (not part of Xvfb upstream, but many distros package it)git clone -b reproduce-sdl-regression https://gitlab.com/smcv/osk-sdl.git
osd.conf
to be correct for Debian/Ubuntu (please adjust as necessary for your distro)cd osk-sdl
meson _build && ninja -C _build
export OSK_SDL_EXE_PATH=$(pwd)/_build/osk-sdl
export OSK_SDL_CONF_PATH=$(pwd)/osk.conf
export TEST_DEBUG=1
export TEST_RECORD=$(pwd)/test.webm
export LD_LIBRARY_PATH=$HOME/SDL/build/.libs
(or wherever you built SDL)xvfb-run -a -s '-ac -screen 0 480x800x24' ./test/test_functional.sh test_keyscript_mouse_toggle_osk
Expected result:
Success!
and exits with status 0INFO: keyboard toggle hit x: 460, y: 775
Actual result:
Please note that when the test fails, it leaks an osk-sdl process which will occupy 100% of a CPU. This is probably a bug in the test script and/or osk-sdl, but it's not my bug, and probably not SDL's bug either; it's just something to watch out for while debugging, because these processes will build up across multiple failures.
Setting the
SDL_POLL_SENTINEL
environment variable to either 0 or 1 does not seem to have any effect on this.This is https://bugs.debian.org/1001809 downstream. At the moment I don't know whether it is a SDL bug, or whether osk-sdl is doing something wrong.