Optimizations which could be useful, from an sdl1.2 fork

libsdl-org / SDL

Simple Directmedia Layer

https://libsdl.org

zlib License

9.92k stars 1.84k forks source link

Optimizations which could be useful, from an sdl1.2 fork #8178

Closed Lord-Nightmare closed 1 year ago

Lord-Nightmare commented 1 year ago

Are any of the optimizations from https://github.com/aronson/SDL-1.2/ worthwhile (or even possible) to be merged upstream into sdl12-compat? The modifications were apparently done with the game 'Cogmind' in mind, but after finding that fork, I was wondering if they would be useful in general.

icculus commented 1 year ago

sdl12-compat just uses the SDL2 blitters, so I've moved this bug over to SDL's issue tracker in case they are useful for SDL2 and/or SDL3.

The primary wins are going to be the AVX2 and SSE4.1 blitters, but I haven't looked at the fork yet to comment intelligently on it. It might even make sense to merge them directly into SDL-1.2, but that would only be for people cloning from GitHub: we aren't going to do another SDL 1.2.x release, since we have moved on to sdl12-compat.

We also have an issue pending in https://github.com/libsdl-org/sdl12-compat/issues/303 (from @aronson, the owner of that SDL-1.2 fork) to make Cogmind work directly with sdl12-compat, but we've been slammed and haven't had a chance to work on it yet.

icculus commented 1 year ago

Oh, also, if we do merge these into SDL2/SDL3, we'd need permission to license them under the zlib license; SDL-1.2 uses the LGPL license. This might not be a problem, but we'd definitely need confirmation on that before any merge could go forward, if it makes sense to merge.

aronson commented 1 year ago

Hello, I understand these well as I was involved in writing them as part of a collaborative community effort. Permission for the zlib license is not an issue.

The only interesting performance win in my fork is the blitters you mention. I don't know how to properly integrate them into the existing assembly implementation configuration within that file, but I could learn. I believe that configuration is currently broken, as I must use --disable-assembly to get a working build these days.

There are problems with implementing these blitters as-is that can be worked around with additional effort I'm willing to put in. Right now the channel order is hardcoded, and set to workaround a "bug" in Cogmind.

The performance gains can be massive. On my Ryzen 7 5900X they don't make a difference at 1080p, but in Cogmind, a game that relies entirely on the blitting API, the game goes from 15FPS in 4k under stress to 60FPS with the AVX2 blitter. On my 10" ASUS Transformer netbook with an Intel Cherry Trail Atom, I am able to render in 4k 60FPS from 7FPS -> 15FPS compared to stock, using the SSE 4.1 blitter. The gains are only apparent when the game is under significant stress.

The blitters are designed to use 2-wide and 4-wide (in terms of output) pixel pipelines for SSE 4.1 and AVX2 respectively. The alpha blend algorithm is implemented with SIMD operations and its most complex component involves taking the 8-bit integer packed vector pixels and widening them to 16-bit packed integer format to perform the multiply operation, and then reducing back down to 8-bit color channel members. There is a consideration for AVX2 in terms of an (impossible) cross-lane shuffle that makes the SIMD code a bit more unwieldy to reason with but we found a way to perform it quickly without such a shuffle. I found through performance profiling that this alpha blend operation was the hottest path in Cogmind's engine under stress.

I also have ARM NEON implementations of these blitters.

I have performed an extensive amount of work testing and perfecting these blitters and would love to see my work included in the upstream! I'm happy to do any work required to get this ready for upstream. Some sort of guidelines on how to do so would be immensely helpful. I'll also have to cut out the Cogmind-specific hacks.

I would also like to add I have tested and experimental implementations of these blitters for ISPC that can be more performant in some situations. There is a lot of room at play here and I'd be happy to work within the SDL2/3 space if there is interest.

EDIT: I will work on opening a PR for you, stay tuned! EDIT2: I'm working within my new blitters branch based on main if you want to review minimal up-to-date changes. I still need to resolve the channel swapping issue, as a game like Cogmind that uses a non-standard SDL_PixelFormat will have color channels swapped otherwise. I'd prefer to do this in a SIMD way, which is why it will take some time. EDIT3: I completed the arbitrary color channel swap. Just need to write up a PR if the intent is to perhaps merge this into SDL-1.2. ~~I will look into porting forward to SDL2/3.~~ It will be a trivial port and likely have the same effect as the SDL2/3 code is scalar. EDIT4: I have a port for SDL2 in my blitters-sdl2 branch. Tested and working with basic tests, but I want to collect profiling data on real x86_64 machines before opening a PR. Unfortunately I'm not seeing speedups but I've only tested on very fast machines. I will test on much older machines to see if it has an effect.

EDIT5: PR to SDL-1.2 opened as I have provable performance gains there. Still working on assessing if SDL2/3 can make use of these in a meaningful way. Going to craft a test of blitting white noise as fast as possible to stress the algorithm.

aronson commented 1 year ago

Double-commenting instead of editing as I have a significant update:

I have tested and discovered the bltting API internally is different in SDL2/3. There are routines that get autovectorized for well-known blends like ARGB->ARGB and ABGR->ARGB at 4bpp. I created my minimal test program and didn't get speedups at first, but very minor slowdowns! I realized that the color-channel swap approach I recently added was slower compared to these autovectorized functions. Removing it (no additional color swap) I finally saw a ~100% speedup in my benchmark of the blitting API compared to stock on an AVX2 machine! I got 18FPS unstable -> 40FPS rock-solid in a test of drawing random rectangles of white noise with random alpha as fast as possible in 16x16 chunks.

It would seem that the correct approach is for me to handle the color channel swap in the shuffle we already do within the actual blitting methods instead of as a pre-init step, with shuffle masks that are selected based on the input SDL_PixelFormat. I believe a programmatic approach to build the masks based on the SDL_PixelFormat is appropriate. This should handle all possible color channel formats in a matter faster than stock. It will take a bit of time, but I've finally proven that my blitters are useful to SDL2/3. My benchmark is as such if you wish to review and tell me of problems in my theory. Note I did this in SDL2, but I will open a PR to 3 first. I'm using SDL2 because I can also test with Cogmind through sdl12-compat, and I saw a ~10FPS speedup at 4k (22->34FPS under stress).

#include <SDL.h>
#include <stdlib.h>
#include <time.h>

float calculateFPS(Uint32 frameStart, Uint32 frameEnd) {
    float frameTime = (float)(frameEnd - frameStart);
    if (frameTime > 0) {
        return 1000.0f / frameTime;
    }
    return 0;
}

int main(int argc, char* argv[]) {
    SDL_Init(SDL_INIT_VIDEO);

    SDL_Window* window = SDL_CreateWindow("White Noise",
                                          SDL_WINDOWPOS_UNDEFINED,
                                          SDL_WINDOWPOS_UNDEFINED,
                                          1800, 1200,
                                          SDL_WINDOW_SHOWN);

    SDL_Surface* screenSurface = SDL_GetWindowSurface(window);

    srand(time(NULL));

    int quit = 0;
    SDL_Event event;

    Uint32 frameStart, frameEnd;
    float fps;
    int frameCounter = 0;
    Uint32 titleUpdateTimer = SDL_GetTicks();

    while (!quit) {

        frameStart = SDL_GetTicks();
        while (SDL_PollEvent(&event)) {
            if (event.type == SDL_QUIT) {
                quit = 1;
            }
        }

        for (int i = 0; i < 25000; ++i) {
            int x = rand() % 1777;
            int y = rand() % 1177;
            int alpha = rand() % 256;

            // Fill a 16x16 rectangle for the blit with white noise and random alpha
            SDL_Surface* whiteRect = SDL_CreateRGBSurface(0, 16, 16, 32, 0x00FF0000, 0x0000FF00, 0x000000FF, 0xFF000000);
            SDL_FillRect(whiteRect, NULL, SDL_MapRGBA(whiteRect->format, 255, 255, 255, alpha));

            SDL_Rect dstrect;
            dstrect.x = x;
            dstrect.y = y;
            SDL_BlitSurface(whiteRect, NULL, screenSurface, &dstrect);

            SDL_FreeSurface(whiteRect);
        }

        SDL_UpdateWindowSurface(window);
        frameEnd = SDL_GetTicks();
        fps = calculateFPS(frameStart, frameEnd);
        frameCounter++;

        if (SDL_GetTicks() - titleUpdateTimer > 500) {
            char title[64];
            snprintf(title, sizeof(title), "White Noise - FPS: %.2f", fps);
            SDL_SetWindowTitle(window, title);
            printf("FPS: %.2f\n", fps);
            titleUpdateTimer = SDL_GetTicks();
            frameCounter = 0;
        }
    }

    SDL_DestroyWindow(window);
    SDL_Quit();

    return 0;
}

To run this benchmark yourself, check out my SDL2 branch I mentioned previously and remove the three calls to convertPixelFormatsx4 in the new functions. I think I also mistakenly broke the CPUID import on MinGW so you may need to comment it out and just hack the booleans (hasAVX2 and hasSSE41) yourself based on your CPU's features.

slouken commented 1 year ago

You can actually chain cogmind through sdl12-compat -> sdl2-compat -> SDL3 and in theory it should run. There will be a lot of variables, but I'm curious what the performance difference is between cogmind on your SDL 1.2 and cogmind chained through to SDL3.

icculus commented 1 year ago

So faster blitters are always good, so I want to get these into revision control regardless of what Cogmind needs, but I'm wondering if there's a value in making sdl12-compat optionally treat all surfaces as textures.

If all the game is doing is blitting existing surfaces to the screen, getting the blits onto the GPU is the actual optimization win.

(Right now sdl12-compat only uses a texture for the screen surface and uses software blitters for everything else.)

slouken commented 1 year ago

There's precedence for this, that's what the SDL_HWSURFACE flag is for...

aronson commented 1 year ago

I'm wondering if there's a value in making sdl12-compat optionally treat all surfaces as textures

This is something I have explored for many hours myself, but never made much progress on as all I know of SDL internals are what I've reverse engineered for myself. Talking to a colleague long ago who heavily optimized Minecraft's renderer in OGL, we both think this could have a significant impact on applications such as Cogmind and is the correct approach. I presumed this was a Herculean effort due to how the game was designed, but perhaps it's not!

If all the game is doing is blitting existing surfaces to the screen

Unfortunately Cogmind blits new surfaces to the screen every frame. There is no re-use. I'm not sure if this has implications for the texture strategy other than stressing the memory transport to the GPU. Allow me to explain how it works. Cogmind is a terminal emulator underneath with square tiles of 16px (more or less depending on resolution) with a fixed minimum size of something around 80x60 tiles. Every glyph in the game starts as a bitmap on disk and is considered part of a "font" that is rendered in the terminal emulator. In addition to pure glyphs, there are color gradients applied to the entire tile for some effects. Cogmind identifies which tiles in its engine need an update every frame, and dispatches a blit request to SDL for each tile. Technically, several are dispatched per tile in order for the logical graphical layers (background, optional item, optional enemy, optional color gradient, or text glyph) and blended when the dev calls the flip at the end of his frame update dispatch. In many other cases there is minimal blending, as the blit simply replaces the entire tile with a new glyph. Both situations are where the alpha blend blitter becomes relevant. Each blit request and origin surface is constructed anew every frame. Panning the map, forcing a re-render of ~85% of the tiles, is an effective means to stress the engine and is barely 15 FPS in 4k on a very fast gaming desktop. While this worked great at resolutions up to 1080p (60 FPS panning) 4k is where it begins to break down. GPUs were certainly made for this kind of computational work.

As an aside, I've thought about a fork to identify input surfaces in a common form (e.g. "this is a level 1 grunt enemy at tile 50x67") and convert them to managed textures to facilitate re-use and shift the blits to the GPU, but it quickly became a nightmare.

I'm curious what the performance difference is

I should have data for you within the day, stay tuned!

icculus commented 1 year ago

There's precedence for this, that's what the SDL_HWSURFACE flag is for...

My memory is hazy...I think there was an early attempt at this, but so many games misused/misunderstood this flag that we ended up abandoning it and using software surfaces for everything, uploading the final scene to a texture.

But as a quirk, we might be able to make it work for specific games that only ever create a bunch of static surfaces and blit them to the screen (which might not be Cogmind, so we can take this side discussion to sdl12-compat's tracker later).

Unfortunately Cogmind blits new surfaces to the screen every frame. There is no re-use.

So he has the glyph bitmaps in memory somewhere, and when he needs to blit, he creates a new SDL_Surface with that bitmap data, blits that new surface somewhere (the screen? Some final compositing surface that eventually goes to the screen?), and then destroys that new surface? Just making sure I'm understanding correctly here.

aronson commented 1 year ago

when he needs to blit, he creates a new SDL_Surface with that bitmap data, blits that new surface somewhere

Indeed, this is what happens. He blits directly to an intermediate surface the size of the window with many new small blit surfaces every frame and then flips this into the screen buffer, if I remember correctly. He might skip the intermediate buffer and go straight to the screen surface but given my conversations with him about it I think there's an intermediate surface involved with a flip. Each frame can be anywhere from 0 to ~4800 individual blits. And he does destroy the surfaces.

icculus commented 1 year ago

...I don't suppose he'd be interested in, like, optimizing this in the app that he's still developing instead of everyone trying to hook in workarounds...? :)

aronson commented 1 year ago

I've talked with him about it at length! The game isn't even shipped with optimizations itself (a dream for a reverse engineer).

Cogmind is 60 FPS+ stable on ~99% of users' machines. It's exotic platforms and situations like Apple Silicon or 4k support that bring out these "problems". Cogmind has been in development for 10 years with no sign of slowing down, and Kyzrati's official stance is a port to SDL2/3 when the features are complete.

So maybe in five years!

As for my blitters I think am literally the only person to play at 4k and notice and complain. The dev added an FPS counter just for me. I've seen one other user play in 2.5k and they didn't notice the frame drops.

I want to make cogmind run better on my mac and desktop so I started researching all this stuff. I'm happy to continue helping and bring these to everyone if useful.

aronson commented 1 year ago

@slouken I put together naive ports of my blitters to SDL3 and have numbers for you using the compatibility layers.

Cogmind performance statistics at 4k on a Ryzen 7 5800X:

Idle in a rich scene:

SDL 1.2
- stock: 163 FPS
- new: (identical)
SDL2
- stock: 99 FPS
- new: 101 FPS
SDL3
- stock: 140 FPS
- new: (identical)

Under medium stress (mashing a menu re-render):

SDL 1.2
- stock: 55 FPS
- new: 135 FPS
SDL2
- stock: 43 FPS
- new: 47 FPS
SDL3
- stock: 46 FPS
- new: 53 FPS

Under high stress (panning the map):

SDL 1.2
- stock: 16 FPS
- new: 117 FPS (baffling but consistent)
SDL2
- stock: 24 FPS
- new: 34 FPS
SDL3
- stock: 30 FPS
- new: 34 FPS

I also ran my benchmark through sdl2-compat as a quick test and saw an improvement of 6.9 FPS -> 8 FPS, both rock-solid stable. I will figure out a native SDL3 build of the benchmark. I'm not sure why the effective performance difference goes down as we go into more modern projects. I did my best to filter out variance in these numbers through sampling for 20s at a time and keeping the machine idle.

EDIT: SDL3 branch here. Still need to figure out the module pattern to support all builds.

aronson commented 1 year ago

Another significant update.

I ported it to SDL3 with enhancements I derived, ran my synthetic benchmark, and saw ~1300% speedups under AVX2 and ~380-1000% speedups under SSE4.1 compared to the original implementation. See diff here

I tested on four machines:

2020 i7 1068NG7 MBP (AVX2): 0.79 FPS -> 11.2 FPS
2014 i7-4790k Win10 laptop (AVX2): 1 FPS -> 14 FPS
2009 Intel Core2Duo (P8700) MBP (first processor family with SSE4.1): 0.51 FPS -> 5.74 FPS
2021 M1 Max (ARM emulating SSE4.1 under Rosetta/Wine): 4.20 FPS -> 20.41 FPS

I figured out the module pattern and conditional compile. Builds work on nearly all platforms now, but GCC 9.4.0 needs a little more work from me to fix. Also having an issue in three Windows builds.

I made further optimizations to the algorithm. I now shuffle 4-wide under SSE 4.1 within a buffer for color channel alignment to ARGB in both SSE4.1 and AVX2 and do so in a batch up-front instead of weaved in with each 2-wide blit for SSE4.1. The AVX2 routine features a similar buffer pattern. It has proven quite effective in profiling.

I get the best results with modern MinGW and compiler optimizations enabled. I have attached the source to the benchmark I crafted. I intentionally picked a color format that requires shuffling and thus would benefit from BlitNtoNPixelAlpha_(SSE4_1/AVX2). I have not tested under pure Linux or macOS yet, but will have numbers in my PR. EDIT: ^^ looks like ARGB8888 format surfaces see nearly the same speedup even without skipping the effective no-op color shuffle.

Once I have solved the build issues I will open a PR to the project.

SDL3 Benchmark:

#include <SDL3/SDL.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>

float calculateFPS(Uint32 frameStart, Uint32 frameEnd) {
    float frameTime = (float)(frameEnd - frameStart);
    if (frameTime > 0) {
        return 1000.0f / frameTime;
    }
    return 0;
}

int main(int argc, char* argv[]) {
    SDL_Init(SDL_INIT_VIDEO);

    SDL_Window* window = SDL_CreateWindow("White Noise",
                                          800, 600, 0);

    SDL_Surface* screenSurface = SDL_GetWindowSurface(window);

    srand(time(NULL));

    int quit = 0;
    SDL_Event event;

    Uint32 frameStart, frameEnd;
    float fps;
    int frameCounter = 0;
    Uint32 titleUpdateTimer = SDL_GetTicks();

    while (!quit) {

        frameStart = SDL_GetTicks();
        while (SDL_PollEvent(&event)) {
            if (event.type == SDL_EVENT_QUIT) {
                quit = 1;
            }
        }

        for (int i = 0; i < 25000; ++i) {
            int x = rand() % 780;
            int y = rand() % 580;
            int alpha = rand() % 256;

            SDL_Surface* whiteRect = SDL_CreateSurface(16, 16, SDL_PIXELFORMAT_BGRA8888);
            SDL_FillSurfaceRect(whiteRect, NULL, SDL_MapRGBA(whiteRect->format, 255, 255, 255, alpha));

            SDL_Rect dstrect;
            dstrect.x = x;
            dstrect.y = y;
            SDL_BlitSurface(whiteRect, NULL, screenSurface, &dstrect);

            SDL_DestroySurface(whiteRect);
        }

        SDL_UpdateWindowSurface(window);
        frameEnd = SDL_GetTicks();
        fps = calculateFPS(frameStart, frameEnd);
        frameCounter++;

        if (SDL_GetTicks() - titleUpdateTimer > 500) {
            char title[64];
            snprintf(title, sizeof(title), "White Noise - FPS: %.2f", fps);
            SDL_SetWindowTitle(window, title);
            printf("FPS: %.2f\n", fps);
            titleUpdateTimer = SDL_GetTicks();
            frameCounter = 0;
        }
    }

    SDL_DestroyWindow(window);
    SDL_Quit();

    return 0;
}

slouken commented 1 year ago

Nice! You're doing great work here! :)

aronson commented 1 year ago

Thanks!

PR to SDL3 opened

slouken commented 1 year ago

I'll close this and wait for your PR to be complete.