Open itzpr3d4t0r opened 5 months ago
I don't really understand the proposal, but piping in on this:
Create a register cache (of __m256i) and load the Surface’s pixels once.
It sounds like you want to store an arbitrary amount of data in SIMD registers-- that's not possible. There are a set amount of these registers in each processor core, like 16. http://www.infophysics.net/amd64regs.png
XMM are the SSE registers, YMM are the AVX registers, ZMM (not shown in that diagram) are the AVX512 registers.
The intrinsics look like they're giving us registers, but it doesn't compile as 1-for-1 as it seems. I think some are routines have more intrinsics variables than there are physical registers.
I don't really understand the proposal, but piping in on this:
Create a register cache (of __m256i) and load the Surface’s pixels once.
It sounds like you want to store an arbitrary amount of data in SIMD registers-- that's not possible. There are a set amount of these registers in each processor core, like 16. http://www.infophysics.net/amd64regs.png
The intrinsics look like they're giving us registers, but it doesn't compile as 1-for-1 as it seems. I think some are routines have more intrinsics variables than there are physical registers.
I'm aware of all of this but it looked like it was working anyway. But with further experimentation i found out that using memcpy directly yielded similar if not better results while not requiring any heap allocation. This is true for blitcopy alone tho as i still need to check for other blendmodes.
Had a little peek at this.
Are you intending to add alpha blending support, or special blend mode support, in this PR or in a future one?
Optimizing Multi Image Blitting
In a typical blit sequence, we often need to draw the same surface at multiple positions. The current approach is:
This method is straightforward but not optimal. It requires running the same checks for each surface separately, even if it’s the same surface. This checking process takes time, invalidating the destination’s pixel cache before the actual blit.
The current process involves checking the type of blit we want, then reading/writing the pixels. However, by the time we get to the next surface in the sequence, the temporal caching mechanisms are invalidated due to the time elapsed.
I propose a new method for blitting multiple copies of the same Surface onto another Surface. This method allows for faster iteration, reduced checking overhead, and improved temporal coherency, resulting in higher performance.
The proposed approach involves a new format in fblits:
This format takes a tuple containing a reference to the surface and a list of positions we want to draw to.
Inside the function, we can use memcpy for blitting the same surface more efficiently and run the checks only once, leading to significant performance improvements.
Conclusions
How to use: To implement this, pass a tuple in the format
(Surface_To_Draw, List_Of_Positions_To_Draw)
:TODO:
The results so far are very promising: (TO BE UPDATED)
Here is a little program I've been experimenting with to get these numbers: