libsdl-org / sdl12-compat

An SDL-1.2 compatibility layer that uses SDL 2.0 behind the scenes.
Other
191 stars 40 forks source link

makes SDL_perl regress when it (incorrectly) frees a former video surface #305

Closed smcv closed 11 months ago

smcv commented 1 year ago

Since trying to switch from classic SDL 1.2 to sdl12-compat in Debian, we've been seeing crashes in the test perl t/core_video.t in the test suite of the SDL Perl bindings (packaged as libsdl-perl). On 32-bit platforms it seems to segfault repeatably. On 64-bit platforms it usually succeeds, but using valgrind or AddressSanitizer reveals that there is still a use-after-free, it's just not fatal for whatever reason.

I was able to cut down the failing test to this C code, which I think is equivalent to what the relevant part of t/core_video.t is doing:

/* gcc -ot t.c `pkg-config --cflags --libs sdl`
 * SDL_VIDEODRIVER=dummy valgrind --leak-check=no ./t
 */
#include <SDL/SDL.h>

int main(void)
{
    SDL_Surface *one;
    SDL_Surface *two;

    one = SDL_SetVideoMode(640, 480, 32, SDL_SWSURFACE);
    two = SDL_SetVideoMode(100, 100, 16, SDL_SWSURFACE);

    /* I'm not sure what order the Perl code is really freeing them in, it's using automatic memory management */
    SDL_FreeSurface(two);
    SDL_FreeSurface(one);
    return 0;
}

With classic SDL 1.2 (1.2.15, from Debian 12) this succeeds, and valgrind --leak-check=no reports no memory errors.

With sdl12-compat (1.2.64 plus the fixes I contributed since then, from Debian unstable) this fails:

==17851== Invalid read of size 4
==17851==    at 0x4865EEC: SDL_FreeSurface (in /usr/lib/i386-linux-gnu/libSDL-1.2.so.1.2.64)
==17851==    by 0x1091FA: main (in /home/user/t)
==17851==  Address 0x4c9b680 is 56 bytes inside a block of size 60 free'd
==17851==    at 0x4841EA7: free (vg_replace_malloc.c:872)
==17851==    by 0x4F4BE17: ??? (in /usr/lib/i386-linux-gnu/libSDL2-2.0.so.0.2800.1)
==17851==    by 0x4F4C170: ??? (in /usr/lib/i386-linux-gnu/libSDL2-2.0.so.0.2800.1)
==17851==    by 0x4865F29: SDL_FreeSurface (in /usr/lib/i386-linux-gnu/libSDL-1.2.so.1.2.64)
==17851==    by 0x4866083: ??? (in /usr/lib/i386-linux-gnu/libSDL-1.2.so.1.2.64)
==17851==    by 0x486B554: SDL_SetVideoMode (in /usr/lib/i386-linux-gnu/libSDL-1.2.so.1.2.64)
==17851==    by 0x1091DB: main (in /home/user/t)
==17851==  Block was alloc'd at
==17851==    at 0x483F660: malloc (vg_replace_malloc.c:381)
==17851==    by 0x4F4BE97: ??? (in /usr/lib/i386-linux-gnu/libSDL2-2.0.so.0.2800.1)
==17851==    by 0x4F4C054: ??? (in /usr/lib/i386-linux-gnu/libSDL2-2.0.so.0.2800.1)
==17851==    by 0x485E177: ??? (in /usr/lib/i386-linux-gnu/libSDL-1.2.so.1.2.64)
==17851==    by 0x4865C3F: SDL_CreateRGBSurface (in /usr/lib/i386-linux-gnu/libSDL-1.2.so.1.2.64)
==17851==    by 0x486BD88: SDL_SetVideoMode (in /usr/lib/i386-linux-gnu/libSDL-1.2.so.1.2.64)
==17851==    by 0x1091C8: main (in /home/user/t)

I tried to understand what the ownership model is for these surfaces, but found it quite confusing.

smcv commented 1 year ago

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1041211 downstream.

sezero commented 1 year ago

valgrind report on i686-linux:

==9153== Invalid read of size 4
==9153==    at 0x401FD79: SDL_FreeSurface (SDL12_compat.c:5182)
==9153==    by 0x8048554: main (1.c:16)
==9153==  Address 0x406ad58 is 56 bytes inside a block of size 60 free'd
==9153==    at 0x4006CAF: free (vg_replace_malloc.c:446)
==9153==    by 0x44FBEBC: real_free (SDL_malloc.c:5199)
==9153==    by 0x44FBD21: SDL_free_REAL (SDL_malloc.c:5320)
==9153==    by 0x448453D: SDL_free (SDL_dynapi_procs.h:408)
==9153==    by 0x401FDDB: SDL_FreeSurface (SDL12_compat.c:5190)
==9153==    by 0x4020A2E: EndVidModeCreate (SDL12_compat.c:5508)
==9153==    by 0x4022DE8: SetVideoModeImpl (SDL12_compat.c:5982)
==9153==    by 0x4023DC8: SDL_SetVideoMode (SDL12_compat.c:6329)
==9153==    by 0x8048538: main (1.c:12)
==9153== 
==9153== Invalid write of size 4
==9153==    at 0x401FD82: SDL_FreeSurface (SDL12_compat.c:5182)
==9153==    by 0x8048554: main (1.c:16)
==9153==  Address 0x406ad58 is 56 bytes inside a block of size 60 free'd
==9153==    at 0x4006CAF: free (vg_replace_malloc.c:446)
==9153==    by 0x44FBEBC: real_free (SDL_malloc.c:5199)
==9153==    by 0x44FBD21: SDL_free_REAL (SDL_malloc.c:5320)
==9153==    by 0x448453D: SDL_free (SDL_dynapi_procs.h:408)
==9153==    by 0x401FDDB: SDL_FreeSurface (SDL12_compat.c:5190)
==9153==    by 0x4020A2E: EndVidModeCreate (SDL12_compat.c:5508)
==9153==    by 0x4022DE8: SetVideoModeImpl (SDL12_compat.c:5982)
==9153==    by 0x4023DC8: SDL_SetVideoMode (SDL12_compat.c:6329)
==9153==    by 0x8048538: main (1.c:12)
==9153== 
==9153== Invalid read of size 4
==9153==    at 0x401FD88: SDL_FreeSurface (SDL12_compat.c:5183)
==9153==    by 0x8048554: main (1.c:16)
==9153==  Address 0x406ad58 is 56 bytes inside a block of size 60 free'd
==9153==    at 0x4006CAF: free (vg_replace_malloc.c:446)
==9153==    by 0x44FBEBC: real_free (SDL_malloc.c:5199)
==9153==    by 0x44FBD21: SDL_free_REAL (SDL_malloc.c:5320)
==9153==    by 0x448453D: SDL_free (SDL_dynapi_procs.h:408)
==9153==    by 0x401FDDB: SDL_FreeSurface (SDL12_compat.c:5190)
==9153==    by 0x4020A2E: EndVidModeCreate (SDL12_compat.c:5508)
==9153==    by 0x4022DE8: SetVideoModeImpl (SDL12_compat.c:5982)
==9153==    by 0x4023DC8: SDL_SetVideoMode (SDL12_compat.c:6329)
==9153==    by 0x8048538: main (1.c:12)
==9153== 
==9153== 
==9153== HEAP SUMMARY:
==9153==     in use at exit: 125,634 bytes in 236 blocks
==9153==   total heap usage: 384 allocs, 148 frees, 3,816,824 bytes allocated
==9153== 
==9153== LEAK SUMMARY:
==9153==    definitely lost: 0 bytes in 0 blocks
==9153==    indirectly lost: 0 bytes in 0 blocks
==9153==      possibly lost: 80,040 bytes in 2 blocks
==9153==    still reachable: 45,594 bytes in 234 blocks
==9153==         suppressed: 0 bytes in 0 blocks
==9153== Rerun with --leak-check=full to see details of leaked memory
==9153== 
==9153== For counts of detected and suppressed errors, rerun with: -v
==9153== ERROR SUMMARY: 3 errors from 3 contexts (suppressed: 30 from 13)
smcv commented 1 year ago

The larger test-case that I condensed to get this should work on any recent-ish Debian/Ubuntu system, if you happen to have one available:

apt install libsdl-perl
apt source libsdl-perl
cd libsdl-perl-*
perl ./t/core_video.t

plus an appropriate LD_LIBRARY_PATH to get sdl12-compat loaded, if your distro is old enough to default to classic SDL 1.2, which at the time of writing applies to everything except Debian unstable.

I believe fixing this libsdl-perl test suite regression is the last blocker for getting sdl12-compat used as the replacement for classic SDL 1.2 in Debian 13, and hopefully Ubuntu 23.10 as well.

If my reproducer (and therefore libsdl-perl) is doing something that was never meant to work, then fixing it from the API-user side is likely also an option.

icculus commented 1 year ago

So sdl12-compat generally tries to be bug-for-bug compatible with classic 1.2, but to be clear: calling SDL_FreeSurface on the return value of SDL_SetVideoMode is illegal.

Doubly-so for freeing a surface that was possibly already freed by a second call to SDL_SetVideoMode. That this test works in classic 1.2 is pure luck.

icculus commented 1 year ago

Yeah, it's the second FreeSurface that's upset; both SDL 1.2 and sdl12-compat see the first one is the current screen surface and treat this as a no-op. The second call isn't a valid pointer anymore, hence the problem.

This works in SDL 1.2 because SDL_SetVideoMode happens to return the same surface pointer with the dummy video backend.

But to be fair, the X11 backend preserves the pointer, too, in this case. Ugh. Okay, let me think on this.

smcv commented 1 year ago

Having it be invalid to free the result of SDL_SetVideoMode fits my limited understanding of the memory model in use here. In the terms I'm used to using in the GLib world, I think SDL_SetVideoMode returns a pointer "borrowed" from SDL-internal infrastructure (the caller must not free it, and it may become invalid at a subsequent video mode change), while things like SDL_CreateRGBSurface return a pointer that becomes "owned" by the caller (which may free it whenever convenient). Is that right?

SDL_FreeSurface() special-cases the current video surface to be ignored (you can try as hard as you like to free it in client code, but it will not happen), but doesn't have that special case for a former video surface. Historically, classic SDL 1.2 had a similar setup but with a couple of different variations of what "the current video surface" meant.

In the Perl bindings, the SDL_FreeSurface call is not explicit in the Perl source code: the binding registers a destructor function surface_DESTROY for the SDL::Surface Perl wrappers around C SDL_Surface objects, which calls SDL_FreeSurface internally.

If calling SDL_FreeSurface on the return value of SDL_SetVideoMode is considered to be a programming error (invalid/illegal/undefined behaviour), then I think the Perl binding would have to distinguish between functions where the caller owns the result, like SDL_CreateRGBSurface (which would return a Perl SDL::Surface flagged as "OK to free the surface when the wrapper is destroyed"), and functions where the caller "borrows" the result from SDL, like SDL_SetVideoMode (which would return a Perl SDL::Surface flagged as "don't free the surface when the wrapper is destroyed"). Does that sound about right?

If sdl12-compat aims to be bug-for-bug compatible with classic SDL 1.2 but libsdl-perl is doing something wrong, then it's probably best to fix both sdl12-compat and libsdl-perl.

icculus commented 1 year ago

So here's why this is happening in sdl12-compat and not SDL-1.2:

5981        } else if (VideoSurface12 && (VideoSurface12->surface20->format->format != appfmt)) {
5982            EndVidModeCreate();  /* rebuild the window if changing pixel format */

The internal 2.0 surface format is different than what the app is requesting, so we nuke the surface and start from scratch, even in the dummy driver. appfmt=0x15151002 and VideoSurface12->surface20->format->format=0x16161804. I need to dig further still, but if those match when we hit this point (or can fix the conditional), this test will work.

smcv commented 1 year ago

Is it also considered to be undefined behaviour (invalid, illegal, programming error, whatever) to free the result of SDL_GetVideoSurface()? I think probably yes, for the same reasons?

slouken commented 1 year ago

So here's why this is happening in sdl12-compat and not SDL-1.2:

5981      } else if (VideoSurface12 && (VideoSurface12->surface20->format->format != appfmt)) {
5982          EndVidModeCreate();  /* rebuild the window if changing pixel format */

The internal 2.0 surface format is different than what the app is requesting, so we nuke the surface and start from scratch, even in the dummy driver. appfmt=0x15151002 and VideoSurface12->surface20->format->format=0x16161804. I need to dig further still, but if those match when we hit this point (or can fix the conditional), this test will work.

Or we can just hang onto all the old video surfaces (maybe just free the pixels) until SDL_Quit(). It would take a small amount of memory more, but it wouldn't crash. :)

smcv commented 1 year ago

Or we can just hang onto all the old video surfaces (maybe just free the pixels) until SDL_Quit().

If the Perl bindings are using automatic memory management, but have an explicit SDL_Quit() call, might they end up trying to free a video surface after SDL_Quit()?

(Assuming that real-world programs only call SDL_Quit() during exit, there isn't really much difference between holding onto a resource until SDL_Quit() and intentionally leaking it, so perhaps intentionally leaking it would be pragmatic?)

slouken commented 1 year ago

Hmm, true...

smcv commented 1 year ago

calling SDL_FreeSurface on the return value of SDL_SetVideoMode is illegal

I reported https://github.com/PerlGameDev/SDL/issues/305.

Is it also considered to be undefined behaviour (invalid, illegal, programming error, whatever) to free the result of SDL_GetVideoSurface()? I think probably yes, for the same reasons?

Was that guess correct?

slouken commented 1 year ago

Is it also considered to be undefined behaviour (invalid, illegal, programming error, whatever) to free the result of SDL_GetVideoSurface()? I think probably yes, for the same reasons?

Was that guess correct?

Yep!

icculus commented 1 year ago

Was that guess correct?

Yep, it's the same pointer.

icculus commented 1 year ago

Or we can just hang onto all the old video surfaces (maybe just free the pixels) until SDL_Quit(). It would take a small amount of memory more, but it wouldn't crash. :)

The codepath is set up for this, there's just a block of conditions where we flush everything and start over (going from a software video mode to SDL_OPENGL, etc)...this is probably just overaggressive in this case.

icculus commented 1 year ago

I need to dig further still, but if those match when we hit this point (or can fix the conditional), this test will work.

Also, I'm not a careful reader: obviously this is going from 32 bit to 16 bit between calls. :)

smcv commented 1 year ago

Since libsdl-perl is doing something undefined, I'm now aiming to get this fixed in Debian via libsdl-perl as a higher priority than in sdl12-compat, with sdl12-compat bug-for-bug compatibility as more of a nice-to-have (and if that means we temporarily lose libsdl-perl from Debian, I'll be sad to lose frozen-bubble but it's still a net benefit).

smcv commented 1 year ago

So here's a stupid idea: what if we keep the same global struct representing the SDL 1.2 video surface at all times, and just reallocate the SDL 2 object and pixel buffer that it points to? (see #306)

We can even avoid it being a one-per-process leak (harmless, but an annoying false-positive for valgrind/asan) by having it never be malloc'd in the first place: it can just be global data.

slouken commented 1 year ago

That is a great idea! 😊