smcameron / space-nerds-in-space

Multi-player spaceship bridge simulator game. Captain your starship through adventures with your friends. See https://smcameron.github.io/space-nerds-in-space
GNU General Public License v2.0
724 stars 73 forks source link

Sometimes snis_client crashes on startup #320

Closed smcameron closed 2 years ago

smcameron commented 2 years ago

When this happens, it looks like this:

snis-client-start-bug

I think it may have something to do with the splash screen.

smcameron commented 2 years ago

I saw this in snis_client_log.txt when this happen (compiled without optimization, this time I got the same behavior on the screen, but snis_client didn't exit (or crash) as it usually does in this situation).

X Error of failed request:  BadWindow (invalid Window parameter)
  Major opcode of failed request:  3 (X_GetWindowAttributes)
  Resource id in failed request:  0x5600016
  Serial number of failed request:  567
  Current serial number in output stream:  568
smcameron commented 2 years ago

Got a backtrace:

(gdb) bt
#0  0x00007f3bfca6a695 in XNextEvent () from /lib/x86_64-linux-gnu/libX11.so.6
#1  0x00007f3bfcfaefed in ?? () from /lib/x86_64-linux-gnu/libSDL2-2.0.so.0
#2  0x00007f3bfcfb0834 in ?? () from /lib/x86_64-linux-gnu/libSDL2-2.0.so.0
#3  0x00007f3bfcf1662b in ?? () from /lib/x86_64-linux-gnu/libSDL2-2.0.so.0
#4  0x00007f3bfcf166a6 in ?? () from /lib/x86_64-linux-gnu/libSDL2-2.0.so.0
#5  0x000055fc0a956ceb in process_events (window=0x55fc0f136e50) at snis_client.c:23638
#6  0x000055fc0a9578bf in main (argc=4, argv=0x7ffd559ce6b8) at snis_client.c:23967
smcameron commented 2 years ago

Seems to be crashing inside SDL_PollEvent().

(gdb) list
23633   
23634       /* Our SDL event placeholder. */
23635       SDL_Event event;
23636   
23637       /* Grab all the events off the queue. */
23638       while (SDL_PollEvent(&event)) {
23639   
23640           if (ui_element_list_event(uiobjs, &event))
23641               break;
23642   

Maybe some race between SDL_ShowWindow() and SDL_PollEvent() ?

smcameron commented 2 years ago

A theory:

The splash screen is created and destroyed in its own thread...

static void *splash_screen_fn(__attribute__((unused)) void *arg)
{
   ... code omitted ...

       } while (splash_screen_still_needed);

out:
        if (image)
                SDL_DestroyTexture(image);
        if (renderer)
                SDL_DestroyRenderer(renderer);
        if (splash_window)
                SDL_DestroyWindow(splash_window);
        return NULL;

What if there are unconsumed X events for the splash screen window when the splash screen is unceremoniously destroyed?

SDL_PollEvent() does not take a window param, so presumably it's getting events for all windows of the app, and one of the windows disappears out from under it.

Maybe SDL_PollEvent is not expecting SDL_DestroyWindow to get called from another thread.

smcameron commented 2 years ago

Tentative workaround:

The work around is just to call SDL_HideWindow() instead of SDL_DestroyWindow() for the splash screen so as not to rip the window out from under SDL_PollEvent(). Hopefully SDL_HideWindow doesn't do anything that will bother SDL_PollEvent() if done concurrently. I wasn't able to repro the problem with this fix, but it is somewhat difficult to repro in any case. Time will tell. I will leave this open for now in case the problem persists.

smcameron commented 2 years ago

I think the "right" way to do this would be to create the splash window in the main thread, have the progress bar updated in the splash thread, and in the main loop, call something like maybe_destroy_splash_screen() which would pthread_join() the splash thread, then destroy the window. Since it would be in the main loop, it would not be concurrent with SDL_PollEvent(). However, coordinating the destruction of the splash thread then becomes onerous, introducing another lock/unlock in the main thread for synchronization and ensuring this destruction happens only once, which, not a huge deal, but somehow leaves a bad taste in my mouth, because it introduces per-frame computation for something that happens exactly once, and to top it off, it's more complicated. I think this HideWindow solution, though very hacky in its own way, at least doesn't put more computation into the main loop for something that happens exactly once, so I think I prefer it, if it proves to work, hacky as it is.