libretro / RetroArch

Cross-platform, sophisticated frontend for the libretro API. Licensed GPLv3.
http://www.libretro.com
GNU General Public License v3.0
10.21k stars 1.82k forks source link

Retroachievement causes RetroArch crash after load 2nd game #14742

Open Jobima1st opened 1 year ago

Jobima1st commented 1 year ago

Description

RetroArch consistently crashes when loading a second game only when achievements are enabled. When achievements are disabled it loads content perfectly fine after having already loaded content.

Steps to reproduce the bug

  1. login with retroachievements
  2. load any game with any core
  3. load another game

Environment information

Jamiras commented 1 year ago

I can reproduce on the stable 1.14.0 Windows x64 downloaded from libretro.com. I cannot reproduce with a msvc2019 release build of the v1.14.0 tag.

Attaching a debugger shows this for the stacktrace:

Unhandled exception at 0x00007FFA6E644673 (mswsock.dll) in retroarch.exe: 0xC0000005: Access violation writing location 0x0000000000000020.
>   mswsock.dll!00007ffa6e644673()  Unknown
    mswsock.dll!00007ffa6e64d691()  Unknown
    mswsock.dll!00007ffa6e6410d8()  Unknown
    kernel32.dll!00007ffa70347614() Unknown
    ntdll.dll!00007ffa717e26a1()    Unknown

Clarified steps to reproduce:

  1. Set up achievements
  2. Start a game with achievements
  3. When the game banner shows up (you have earned X of Y achievements), close the content
  4. Reload the same game
  5. Crash

The issue occurs even if I leave the game running for several minutes to ensure the background initialization task completes.

As I can't reproduce in the debugger, I'm not sure how to diagnose/repair. Also, as the problem doesn't occur when I build it myself, I can't bisect.

bizarf commented 1 year ago

I've been having this issue for a while now, and forgot to report it. I just grabbed stable 1.10.0 Windows 64bit build right now because I remember not having this issue somewhere back. I did the steps and it's not crashing when reloading or swapping games. I'm using the same config files from 1.14, so comparing the 1.10 build might help find the issue.

Jamiras commented 1 year ago

That's a good idea. I do have older releases lying around. The behavior starts appearing in 1.11.0. It does not occur in 1.10.3.

Unfortunately, there's over 1000 commits over five months between those versions: https://github.com/libretro/RetroArch/compare/v1.10.3...v1.11.0

bizarf commented 1 year ago

I accidentally discovered something strange on 1.1.4. I loaded up a NES game with the Mesen core and decided to swap to another NES game. It loaded up fine. Starting a new session with an arcade games with the FinalBurn Neo core is also letting me reload it, or swap to games on other cores just fine.

I didn't mention above, but my crashing was happening with GBA games on the mGBA core. Oddly, GBA games are loading fine after I started a NES game with the Mesen core on my current session. If I shut down Retroarch, then load a GBA game with mGBA, then RA will crash if I reload it. The crashing will also happen with GB and GBC games with the Gambatte core, and with Genesis games with the Genesis Plus GX core. I also tried loading a NES game with the FCEUmm core and it's crashing when trying to reload it or swap to a different game. I swapped the core back to Mesen for that NES game and it's not crashing.

I'm not sure if it works only for me, but loading a NES game with Mesen, or an arcade game with FinalBurn Neo on a new session might be a temporary workaround for some reason.

retroNUC commented 1 year ago

Seen over on Reddit too:

https://www.reddit.com/r/RetroArch/comments/1085v4m/retroachievements_crashing_retroarch_when_closing/

retroNUC commented 1 year ago

Tested some different builds myself, definitely some sort of undefined behaviour difference between MinGW (broken) vs MSVC (working).

Suppose I'd have to get the MinGW debugger up and running to pinpoint the exact spot, but running an ASan MSVC build certainly shows some runtime problems in rc_libretro.c that could be cleaned up - Lots of memcmp usage for string comparisons instead of strcmp/strncmp.

retroNUC commented 1 year ago
Thread 23 received signal SIGSEGV, Segmentation fault.
0x00007ffa6a8db277 in WSPStartup () from C:\Windows\system32\mswsock.dll

Caused by the rcheevos_async_begin_request in rcheevos_client_identify_game on the second game boot.

First request: image

Second request (causes segfault): image

My best guess is that data/userdata/datacopy is being freed early, then being dereferenced on the network task.

Side note - Spotted that "rcheevos_locals.load_info.hashes_tried" isn't being reset between sessions, that could probably lead to some funkiness.

retroNUC commented 1 year ago

Bisected to PR https://github.com/libretro/RetroArch/pull/14351 / Commit https://github.com/libretro/RetroArch/commit/e45958b25a140cbd549cfb8771395f3bcd922e48

(Network) Get rid of the timeout_enable parameter for socket_connect

retroNUC commented 1 year ago

The difference between MinGW crashing and MSVC working is that the latter doesn't actually have SSL implemented.

On the second boot, something is calling WSAStartup() (can't tell what's spawning this on a new thread) at the same time that WSAPoll() is being called in socket_connect_with_timeout(), which is what the above commit changed the http socket code to instead socket_connect(). On _WIN32 platforms through that last function, it wasn't actually doing any timeouts at all, despite the request in the parameter.

In any case, both SSL and non-SSL now go through socket_connect_with_timeout() and only the SSL one crashes, which makes me think something's not being reset/shutdown correctly for the next session.

retroNUC commented 1 year ago

Honestly can't tell whether I hate debugging network code or multithreaded code more.

It's a dangling async task for CHEEVOS_ASYNC_FETCH_HARDCORE_USER_UNLOCKS from the first run that's causing the problems once socket_connect_with_timeout() starts being used from the second run's rcheevos async tasks.

Actually, all of the first run's async tasks seem to be sticking around until the game shutdown, which manages to clean up all apart from the hardcore data. Would have though all of those threads would have cleaned themselves up as soon as HTTP requests were finished.

LibretroAdmin commented 1 year ago

@retroNUC How do we proceed from here? When you revert that PR locally does it work properly then? If so, we can start figuring out a way we can fundamentally fix this.

Jamiras commented 1 year ago

It's the addition of this code that's causing the problem (thanks for bisecting): https://github.com/libretro/RetroArch/blob/e45958b25a140cbd549cfb8771395f3bcd922e48/libretro-common/net/net_socket_ssl_mbed.c#L118-L126 which was added by #14351: https://github.com/libretro/RetroArch/pull/14351/files#diff-e68eaa44ad3da41536e0a0bc37e9e75f13750d7d4c29fd6cb54305b98e736f9fR225-R237

I'm not a socket programmer, so I don't know what the actual issue is, but passing false for timeout_enable in net_http_new_socket does seem to fix the issue (as it avoids the referenced new code):

#ifdef _WIN32
      if (ssl_socket_connect(conn->sock_state.ssl_ctx, addr, false, true)
            < 0)
#else
      if (ssl_socket_connect(conn->sock_state.ssl_ctx, addr, true, true)
            < 0)
#endif

Note that it was passing true prior to the changes in #14351. The function was just ignoring the parameter.

I don't know if this change would have any unintended side effects. Perhaps it's better to continue passing true and just make net_socket_ssl_mbed.c ignore the parameter as it had been before.

retroNUC commented 1 year ago

Submitted the temp fix after speaking to @LibretroAdmin, pretty much along the lines of what @Jamiras suggested.

Keep this issue open until we fix properly, though?

Immersion95 commented 1 year ago

Same thing here