libsdl-org / SDL-1.2

Simple Directmedia Layer, 1.2 branch ... ***DEPRECATED***, please use https://github.com/libsdl-org/SDL for new projects!
https://libsdl.org
GNU Lesser General Public License v2.1
98 stars 81 forks source link

Implement AVX2 and SSE4.1 blitters for BlitNtoNPixelAlpha #877

Closed aronson closed 1 year ago

aronson commented 1 year ago

Hello, I bring you a project I've worked on for over a year now. I present to you two intrinsic-powered blitters for modern x86(_64) systems. These blitters promise ~100% and ~300% speedups for SSE4.1 and AVX2 respectively (within this specific hot path of course).

I have tested and iterated on these blitters for a long time with input from many knowledgeable community members. They have been tested on at least a dozen machines of all kinds of x86(_64) for many, many hours of use and are shipped by me within a game community as performance mods.

These blitters have a significant effect in software using the blitting API to render to the screen. Cogmind is such a game for Win32. In this modern era of 4k screens, on a Ryzen 7 5900X, Cogmind will drop from a stable 60 FPS to 15 FPS when panning a fully-explored map with the mouse -- an operation that redraws most of the screen over and over. With my mod it stays at 60 FPS. This would be using the AVX2 blitter. On an Intel Cherry Trail Atom netbook, I was able to go from ~7FPS rendered in 4k to ~14FPS with the SSE 4.1 blitter, as an exaggerated demonstration of the performance.

It's been a while since I counted up the instructions/cycles, but I believe the reason we see this 100% to 300% speedup in the hot path with these blitters is because they use roughly as many cycles to draw 2 or 4 pixels as the original scalar code did for 1 pixel, so it's a purely linear relationship in terms of performance gain.

I can provide Intel VTune outputs demonstrating these gains if necessary.

I open this PR as someone mentioned it might be possible to merge new changes into upstream for those cloning from Github. I think it's mostly ready to use, but I think the intrinsic header imports I added need to be conditional to prevent breaking systems that don't have them. There could also be a better way to determine if SSE4.1 and AVX2 are available at runtime. I'm not sure my code meets the conventions for the SDL project. I recently added a color-channel swapping function for arbitrary SDL_PixelFormats to get the data into the right ARGB form for bpp==4 systems, as my original fork had this hardcoded for Cogmind's format specifically.

I am happy to make any changes necessary for this PR, just ask! I'm also happy to explain any questions you may have. I hope my use of extensive comments makes it easier to reason with this complex intrinsic code.

sezero commented 1 year ago

I don't think we want this in SDL-1.2: @icculus, @slouken? This can (should) be ported to SDL3 and possibly to SDL2, though.

aronson commented 1 year ago

I would like to mention I've already ported it to both SDL2 and 3, I just don't have performance data for you yet. I'm coming from this issue in SDL2/3 someone opened. It was a trivial port, as the function is nearly identical in scalar form in 2/3. I can open a PR to that project, but if you're aware of any software that makes significant use of the blitting API I can test with it would be immensely helpful.

icculus commented 1 year ago

Yeah, there's a separate issue open for SDL2/3.

I haven't tried these yet but I don't mind them going into revision control for 1.2, even if I rather people experience them when SDL2 uses them through sdl12-compat.

sezero commented 1 year ago

Yeah, there's a separate issue open for SDL2/3.

I don't mind them going into revision control for 1.2

Not as they are: The patch assumes recent gcc or msvc, but SDL-1.2 supports old gcc / binutils that doesn't necessarily support target attributes but necessitate -msse4.1 or -mavx2 switch to succeed. Possibly similar concerns for MSVC versions. At the very least, the sse4.1 and avx2 code can be moved into own exclusive modules to be built with corresponding compiler switches with corresponding checks added to configurators for e.g.

I rather people experience them when SDL2 uses them through sdl12- compat.

That would be more convenient: IMO, we should leave SDL-1.2 as is with regards to these new features, and review and merge new code to SDL2/3.

slouken commented 1 year ago

IMO, we should leave SDL-1.2 as is with regards to these new features, and review and merge new code to SDL2/3.

Agreed.

aronson commented 1 year ago

Sounds good to me, closing this one out. I hear you on the maintenance burden and understand this would break existing builds -- the last thing I'd want to do. My fork remains out there for those who want to manually hack it in, if it makes sense to do so.

I will open a PR that includes these blitters in the modern project. I will try to implement the module pattern you mentioned if you think that's relevant for the modern project. The blitting function here is a pure function of a sort and well understood, so I'm not shy porting it to the modern project as-is. I just ask you give me some time to collect performance profiling data and test it with example SDL2 and SDL3 applications. Stay tuned, and thank you for your interest in my feature :)

sezero commented 1 year ago

Thanks!