libsdl-org / SDL

Simple Directmedia Layer
https://libsdl.org
zlib License
9.86k stars 1.83k forks source link

2.0.18 regression: osk-sdl unit test fails #5136

Closed smcv closed 2 years ago

smcv commented 2 years ago

After updating the SDL package in Debian from 2.0.16 to 2.0.18 (+ selected patches recommended by @slouken), an osk-sdl unit test that previously passed has started failing, blocking the "migration" process in Debian (the effect is similar to blocking the SDL update's inclusion in Debian betas).

I am a packager/maintainer of SDL in Debian, but I have no special knowledge of osk-sdl, which I installed for the first time in a test VM today to try to get an easier reproducer / more information.

Bisecting SDL indicates that commit 8bf32e12 "Improved SDL_PollEvent usage (#4794)" is the first one to fail. Reverting that commit (which requires a bit of conflict resolution to account for subsequent changes) makes the test pass again.

To reproduce:

Expected result:

Actual result:

Please note that when the test fails, it leaks an osk-sdl process which will occupy 100% of a CPU. This is probably a bug in the test script and/or osk-sdl, but it's not my bug, and probably not SDL's bug either; it's just something to watch out for while debugging, because these processes will build up across multiple failures.

Setting the SDL_POLL_SENTINEL environment variable to either 0 or 1 does not seem to have any effect on this.

This is https://bugs.debian.org/1001809 downstream. At the moment I don't know whether it is a SDL bug, or whether osk-sdl is doing something wrong.

smcv commented 2 years ago

Here's the patch I tried to revert this change, which makes this test pass again: https://github.com/smcv/SDL/commit/6677d176dc5b5885700af353d437d63d4b4860a7

smcv commented 2 years ago

The osk-sdl event loop as of 0.66 appears to be a busy-loop, which doesn't seem great to me (but it's not my code and maybe there is a good reason why it has to be):

    while (luksDev.isLocked() && !done) {
        show_osk = !keyboardToggle.isVisible();
        if (SDL_WaitEvent(&event)) {
            // an event was found
            switch (event.type) {
            // handle the keyboard
            case SDL_KEYDOWN:
...
smcv commented 2 years ago

This seems to avoid the regression, although I don't know whether it's correct. (Indentation deliberately not adjusted to highlight what the significant change is)

diff --git a/src/events/SDL_events.c b/src/events/SDL_events.c
index f61075eeb..84d7b51af 100644
--- a/src/events/SDL_events.c
+++ b/src/events/SDL_events.c
@@ -954,6 +954,7 @@ SDL_WaitEventTimeout(SDL_Event * event, int timeout)
     Uint32 expiration = 0;

     /* First check for existing events */
+    if (timeout == 0) {
     switch (SDL_PeepEvents(event, 1, SDL_GETEVENT, SDL_FIRSTEVENT, SDL_LASTEVENT)) {
     case -1:
         return 0;
@@ -967,6 +968,7 @@ SDL_WaitEventTimeout(SDL_Event * event, int timeout)
         /* Has existing events */
         return 1;
     }
+    }

     if (timeout > 0) {
         start = SDL_GetTicks();
rohlem commented 2 years ago

I think I have an idea of what is happening:


I don't think this use case aligns with the intention of SDL's current API, though the discrepancies should probably be documented:

EDIT: created downstream issue 138.


EDIT 2: On second thought, maybe it would just work if we generalized the polling sentinel solution from timeout == 0 to timeout <= 0, when peeping and inserting. Additionally for timeout <= 0, we wouldn't want to return early but fall through to the later sections that implement the waiting for events. But maybe I'm missing some issue that would introduce.

EDIT 3: edit 2 would address a separate issue and doesn't yet solve this. We would need to repurpose the polling sentinel even more:

smcv commented 2 years ago

There are three modes:

4794 seems to have been intended to help the polling case, but osk-sdl is using the waiting mode (although because of the way it re-queues events, there is usually an event available, so it rarely actually waits).

I think perhaps ideally we only want to enter the First check for existing events block if there is a sentinel already queued, and not if the queue contains only non-sentinels?

cgutman commented 2 years ago

My opinion is that we should just document this, especially if this is the only known application affected.

The optimization of avoiding SDL_PumpEvents() is a nice performance uplift across the board, regardless of whether an application is doing SDL_PollEvent() or SDL_WaitEvent(). Doing a bunch of syscalls each time the application wants to get an event sucks, especially for high frequency input devices (a main target of #4794). It would be a shame to lose it in some/all cases due to a single broken application.

Normally I'd want to strongly avoid breaking applications, but in this particular case the application is using a pretty unusual event processing technique and the optimization has benefits for all SDL users.

Undef-a commented 2 years ago

I've just tested the proposed fix for the problematic application (osk-sdl) by adding a call to SDL_PumpEvents() into it's main loop. This resolved the deadlock, but SDL_MOUSEBUTTONDOWN and SDL_MOUSEBUTTONUP events still do not trigger. Some others, such as SDL_WINDOWEVENT and SDL_KEYDOWN seem to work fine.

This is the case when built on Debian Sid (SDL 2.0.18) but works fine when built on Debian Bullseye (2.0.16).

icculus commented 2 years ago

Its reasoning was that "as long as there's something in the event queue, WaitEvents doesn't need to call SDL_PumpEvents at all.

Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event" instead of "WaitEvents doesn't need to call SDL_PumpEvents again until the event queue is empty."

This still solves the original problem (calling PumpEvents every time you poll/block for the next event in the queue), but should also fix this issue (be in a position where PumpEvents doesn't get called at all).

Am I way oversimplifying (or simply misunderstanding) the problem?

smcv commented 2 years ago

Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event"

I think so, yes.

If I'm understanding the intention of #4794 correctly, the problem was that high-frequency input devices would enqueue events every time SDL_PumpEvents is called, starving other event sources; and so you don't want to call SDL_PumpEvents until all the events generated by the previous SDL_PumpEvents have been dispatched to the library user (as indicated by reaching the sentinel). Is that right?

What's happening in osk-sdl is almost the other way round: every time osk-sdl processes one of its own "render events", it enqueues another render event, starving the event source that should be talking to X11. Like this:

If we conditionally skipped the same block as in https://github.com/libsdl-org/SDL/issues/5136#issuecomment-1000939687, but instead of the condition for running it being if (timeout == 0), the condition was if (there is a sentinel event in the queue), then I think that would solve osk-sdl's problem:

To achieve that, I think it might be necessary to adjust some of the locking, so that the checks in the block affected by https://github.com/libsdl-org/SDL/issues/5136#issuecomment-1000939687 are all done within a single lock/unlock.

smcv commented 2 years ago

Also, if there is a sentinel but it is at the head of the queue, I think probably we want to fall through to starting a new poll cycle immediately, rather than considering the next event? Otherwise, osk-sdl could do this:

smcv commented 2 years ago

Perhaps the plan should be "WaitEvents doesn't need to call SDL_PumpEvents again until the previous SDL_PumpEvents call has completely drained, as indicated by the sentinel event"

5184 implements this.

0x1F9F1 commented 2 years ago

I need to look at this a bit more carefully, but before the poll sentinel was added, wouldn't the event queue just continue to grow as the application falls further and further behind? Also the poll sentinel commit did add a hint to disable them (which just uses the regular API for disabling events).

smcv commented 2 years ago

Also the poll sentinel commit did add a hint to disable them

That hint doesn't disable the fast-path when the queue is non-empty, only the sentinel event, so it doesn't help osk-sdl.

rohlem commented 2 years ago

I don't quite follow how the scenario in @Undef-a 's report above manifested:

I've just tested the proposed fix for the problematic application (osk-sdl) by adding a call to SDL_PumpEvents() into it's main loop. This resolved the deadlock, but SDL_MOUSEBUTTONDOWN and SDL_MOUSEBUTTONUP events still do not trigger. Some others, such as SDL_WINDOWEVENT and SDL_KEYDOWN seem to work fine.

This is the case when built on Debian Sid (SDL 2.0.18) but works fine when built on Debian Bullseye (2.0.16).

If manual pumping doesn't fix it, I don't see how adapting SDL_WaitEventTimeout would fix it either (unless I'm missing something) - is there another separate issue at play? @smcv Have you attempted patching osk-sdl and encountered this as well?

slouken commented 2 years ago

Can you try this approach? https://github.com/libsdl-org/SDL/pull/5187

slouken commented 2 years ago

This is fixed in the latest commits.