smcv commented 2 years ago

After updating the SDL package in Debian from 2.0.16 to 2.0.18 (+ selected patches recommended by @slouken), an osk-sdl unit test that previously passed has started failing, blocking the "migration" process in Debian (the effect is similar to blocking the SDL update's inclusion in Debian betas).

I am a packager/maintainer of SDL in Debian, but I have no special knowledge of osk-sdl, which I installed for the first time in a test VM today to try to get an easier reproducer / more information.

Bisecting SDL indicates that commit 8bf32e12 "Improved SDL_PollEvent usage (#4794)" is the first one to fail. Reverting that commit (which requires a bit of conflict resolution to account for subsequent changes) makes the test pass again.

To reproduce:

have build-dependencies for osk-sdl and SDL, plus Xvfb, xauth and Debian's xvfb-run script (not part of Xvfb upstream, but many distros package it)
build SDL itself
git clone -b reproduce-sdl-regression https://gitlab.com/smcv/osk-sdl.git
- this branch is based on upstream release 0.66, and adds a few commits to add more debugging to the test script, and adjust a font path in osd.conf to be correct for Debian/Ubuntu (please adjust as necessary for your distro)
cd osk-sdl
meson _build && ninja -C _build
export OSK_SDL_EXE_PATH=$(pwd)/_build/osk-sdl
export OSK_SDL_CONF_PATH=$(pwd)/osk.conf
export TEST_DEBUG=1
export TEST_RECORD=$(pwd)/test.webm
export LD_LIBRARY_PATH=$HOME/SDL/build/.libs (or wherever you built SDL)
xvfb-run -a -s '-ac -screen 0 480x800x24' ./test/test_functional.sh test_keyscript_mouse_toggle_osk

Expected result:

The test passes, logs Success! and exits with status 0
In particular, when the test "clicks" on the button to open the on-screen keyboard, osk-sdl logs INFO: keyboard toggle hit x: 460, y: 775
good.log, good.webm in https://people.debian.org/~smcv/bug1001809/

Actual result:

The test fails
In particular, when the test "clicks" on the button to open the on-screen keyboard, osk-sdl does not log anything and the OSK does not open
bad.log, bad.webm in https://people.debian.org/~smcv/bug1001809/

Please note that when the test fails, it leaks an osk-sdl process which will occupy 100% of a CPU. This is probably a bug in the test script and/or osk-sdl, but it's not my bug, and probably not SDL's bug either; it's just something to watch out for while debugging, because these processes will build up across multiple failures.

Setting the SDL_POLL_SENTINEL environment variable to either 0 or 1 does not seem to have any effect on this.

This is https://bugs.debian.org/1001809 downstream. At the moment I don't know whether it is a SDL bug, or whether osk-sdl is doing something wrong.

smcv commented 2 years ago

Here's the patch I tried to revert this change, which makes this test pass again: https://github.com/smcv/SDL/commit/6677d176dc5b5885700af353d437d63d4b4860a7

smcv commented 2 years ago

The osk-sdl event loop as of 0.66 appears to be a busy-loop, which doesn't seem great to me (but it's not my code and maybe there is a good reason why it has to be):

    while (luksDev.isLocked() && !done) {
        show_osk = !keyboardToggle.isVisible();
        if (SDL_WaitEvent(&event)) {
            // an event was found
            switch (event.type) {
            // handle the keyboard
            case SDL_KEYDOWN:
...

smcv commented 2 years ago

This seems to avoid the regression, although I don't know whether it's correct. (Indentation deliberately not adjusted to highlight what the significant change is)

diff --git a/src/events/SDL_events.c b/src/events/SDL_events.c
index f61075eeb..84d7b51af 100644
--- a/src/events/SDL_events.c
+++ b/src/events/SDL_events.c
@@ -954,6 +954,7 @@ SDL_WaitEventTimeout(SDL_Event * event, int timeout)
     Uint32 expiration = 0;

     /* First check for existing events */
+    if (timeout == 0) {
     switch (SDL_PeepEvents(event, 1, SDL_GETEVENT, SDL_FIRSTEVENT, SDL_LASTEVENT)) {
     case -1:
         return 0;
@@ -967,6 +968,7 @@ SDL_WaitEventTimeout(SDL_Event * event, int timeout)
         /* Has existing events */
         return 1;
     }
+    }

     if (timeout > 0) {
         start = SDL_GetTicks();

rohlem commented 2 years ago

I think I have an idea of what is happening:

The application's main event loop is filled with calls to SDL_PushEvent pushing a custom "render event" type into the queue.
- In particular, the code handling the render event may continuously push new render events into the queue.
- The programmer's assumption must have been that "it will be added to the end of the event queue, behind new input events".
- This is not true, because SDL_PumpEvents (which adds the input events to the queue) was only called after the render event was added (from SDL_WaitEvent).
- In 2.0.16, this resulted in one stale frame, but an otherwise working program.
The optimization in commit 8bf32e1 is incompatible with this usage scenario.
- Its reasoning was that "as long as there's something in the event queue, WaitEvents doesn't need to call SDL_PumpEvents at all.
- However, the evnt queue keeps getting filled without any calls to SDL_PumpEvents. Therefore render events pushed by the user are the only events ever pushed, and keep SDL 2.0.18 from pumping in SDL_WaitEvent.

I don't think this use case aligns with the intention of SDL's current API, though the discrepancies should probably be documented:

SDL_PushEvent queues the event behind queued events, which are only guaranteed as recent as the last call to SDL_PumpEvents.
- Changing this would require SDL_PushEvent (which is currently thread-safe) to trigger a call to SDL_PumpEvents (which should currently only be called by the "main" thread that initialized the video subsystem).
SDL_WaitEvents and SDL_WaitEventsTimeout only call SDL_PumpEvents when the event queue is empty.
- The posted patch only sidesteps the optimization (effectively reverting 8bf32e1 ) for calls to SDL_WaitEventTimeout with timeout != 0 (which leaves SDL_Poll incompatible for this usage scenario).

EDIT: created downstream issue 138.

EDIT 2: On second thought, maybe it would just work if we generalized the polling sentinel solution from timeout == 0 to timeout <= 0, when peeping and inserting. Additionally for timeout <= 0, we wouldn't want to return early but fall through to the later sections that implement the waiting for events. But maybe I'm missing some issue that would introduce.

EDIT 3: edit 2 would address a separate issue and doesn't yet solve this. We would need to repurpose the polling sentinel even more:

on the start of the first WaitEventTimeout, call PumpEvents, set a "did_pump_events" flag, and insert the sentinel event
on subsequent WaitEventTimeout, check the "did_pump_events" flag and don't insert the sentinel (so we don't flood the queue with them)
on encountering the sentinel, reset the "did_pump_events" flag for the next polling cycle

smcv commented 2 years ago

There are three modes:

polling (timeout == 0, SDL_PollEvent)
waiting until event or timeout (timeout > 0)
waiting until event, no timeout (timeout < 0, SDL_WaitEvent)

4794 seems to have been intended to help the polling case, but osk-sdl is using the waiting mode (although because of the way it re-queues events, there is usually an event available, so it rarely actually waits).

I think perhaps ideally we only want to enter the First check for existing events block if there is a sentinel already queued, and not if the queue contains only non-sentinels?

cgutman commented 2 years ago

My opinion is that we should just document this, especially if this is the only known application affected.

The optimization of avoiding SDL_PumpEvents() is a nice performance uplift across the board, regardless of whether an application is doing SDL_PollEvent() or SDL_WaitEvent(). Doing a bunch of syscalls each time the application wants to get an event sucks, especially for high frequency input devices (a main target of #4794). It would be a shame to lose it in some/all cases due to a single broken application.

Normally I'd want to strongly avoid breaking applications, but in this particular case the application is using a pretty unusual event processing technique and the optimization has benefits for all SDL users.

Undef-a commented 2 years ago

I've just tested the proposed fix for the problematic application (osk-sdl) by adding a call to SDL_PumpEvents() into it's main loop. This resolved the deadlock, but SDL_MOUSEBUTTONDOWN and SDL_MOUSEBUTTONUP events still do not trigger. Some others, such as SDL_WINDOWEVENT and SDL_KEYDOWN seem to work fine.

This is the case when built on Debian Sid (SDL 2.0.18) but works fine when built on Debian Bullseye (2.0.16).

icculus commented 2 years ago

Its reasoning was that "as long as there's something in the event queue, WaitEvents doesn't need to call SDL_PumpEvents at all.

Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event" instead of "WaitEvents doesn't need to call SDL_PumpEvents again until the event queue is empty."

This still solves the original problem (calling PumpEvents every time you poll/block for the next event in the queue), but should also fix this issue (be in a position where PumpEvents doesn't get called at all).

Am I way oversimplifying (or simply misunderstanding) the problem?

smcv commented 2 years ago

Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event"

I think so, yes.

If I'm understanding the intention of #4794 correctly, the problem was that high-frequency input devices would enqueue events every time SDL_PumpEvents is called, starving other event sources; and so you don't want to call SDL_PumpEvents until all the events generated by the previous SDL_PumpEvents have been dispatched to the library user (as indicated by reaching the sentinel). Is that right?

What's happening in osk-sdl is almost the other way round: every time osk-sdl processes one of its own "render events", it enqueues another render event, starving the event source that should be talking to X11. Like this:

start with one render event in the queue
the block that is affected by https://github.com/libsdl-org/SDL/issues/5136#issuecomment-1000939687 sees that the event queue is non-empty, pops the render event from the queue and does an early-return
osk-sdl processes the render event and enqueues another one
we effectively go back to the beginning
the code that talks to X11 in SDL_PumpEvents is never reached
osk-sdl never escapes from the "currently animating" state

If we conditionally skipped the same block as in https://github.com/libsdl-org/SDL/issues/5136#issuecomment-1000939687, but instead of the condition for running it being if (timeout == 0), the condition was if (there is a sentinel event in the queue), then I think that would solve osk-sdl's problem:

start with one render event in the queue
the block that is affected by https://github.com/libsdl-org/SDL/issues/5136#issuecomment-1000939687 is skipped
SDL_PumpEvents polls X11 and enqueues 0 or more input events, etc.
also enqueue 1 sentinel
dispatch the first event, which is the render event
- osk-sdl enqueues another one, but it is enqueued after the sentinel
keep dispatching events until we reach the sentinel
- the block that is affected by https://github.com/libsdl-org/SDL/issues/5136#issuecomment-1000939687 sees that the event queue contains a sentinel and does an early-return, so we still avoid the issue fixed by #4794
after the sentinel has been removed from the queue, we effectively go back to the beginning

To achieve that, I think it might be necessary to adjust some of the locking, so that the checks in the block affected by https://github.com/libsdl-org/SDL/issues/5136#issuecomment-1000939687 are all done within a single lock/unlock.

smcv commented 2 years ago

Also, if there is a sentinel but it is at the head of the queue, I think probably we want to fall through to starting a new poll cycle immediately, rather than considering the next event? Otherwise, osk-sdl could do this:

start with one render event in the queue
the early-return block is skipped
SDL_PumpEvents polls X11 and enqueues 0 or more input events, etc.
also enqueue 1 sentinel
dispatch the first event, which is the render event
- osk-sdl enqueues another one [A], but it is enqueued after the sentinel
keep dispatching events until we reach the sentinel
when the sentinel is the head of the queue, the early-return block checks the new head of the queue
the new head of the queue is the render event added at [A]
we early-return and dispatch the render event
osk-sdl enqueues another render event
oops, we loop forever without polling X11 again

smcv commented 2 years ago

Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event"

5184 implements this.

0x1F9F1 commented 2 years ago

I need to look at this a bit more carefully, but before the poll sentinel was added, wouldn't the event queue just continue to grow as the application falls further and further behind? Also the poll sentinel commit did add a hint to disable them (which just uses the regular API for disabling events).

smcv commented 2 years ago

Also the poll sentinel commit did add a hint to disable them

That hint doesn't disable the fast-path when the queue is non-empty, only the sentinel event, so it doesn't help osk-sdl.

rohlem commented 2 years ago

I don't quite follow how the scenario in @Undef-a 's report above manifested:

I've just tested the proposed fix for the problematic application (osk-sdl) by adding a call to SDL_PumpEvents() into it's main loop. This resolved the deadlock, but SDL_MOUSEBUTTONDOWN and SDL_MOUSEBUTTONUP events still do not trigger. Some others, such as SDL_WINDOWEVENT and SDL_KEYDOWN seem to work fine.

This is the case when built on Debian Sid (SDL 2.0.18) but works fine when built on Debian Bullseye (2.0.16).

If manual pumping doesn't fix it, I don't see how adapting SDL_WaitEventTimeout would fix it either (unless I'm missing something) - is there another separate issue at play? @smcv Have you attempted patching osk-sdl and encountered this as well?

slouken commented 2 years ago

Can you try this approach? https://github.com/libsdl-org/SDL/pull/5187

slouken commented 2 years ago

This is fixed in the latest commits.

libsdl-org / SDL

2.0.18 regression: osk-sdl unit test fails #5136

4794 seems to have been intended to help the polling case, but osk-sdl is using the waiting mode (although because of the way it re-queues events, there is usually an event available, so it rarely actually waits).

5184 implements this.