Faster `fblits()` for multiple blits of the same Surface

itzpr3d4t0r commented 5 months ago

Optimizing Multi Image Blitting

In a typical blit sequence, we often need to draw the same surface at multiple positions. The current approach is:

screen.fblits([ (surf, pos1), (surf, pos2), (surf, pos3), ... ])

This method is straightforward but not optimal. It requires running the same checks for each surface separately, even if it’s the same surface. This checking process takes time, invalidating the destination’s pixel cache before the actual blit.

The current process involves checking the type of blit we want, then reading/writing the pixels. However, by the time we get to the next surface in the sequence, the temporal caching mechanisms are invalidated due to the time elapsed.

I propose a new method for blitting multiple copies of the same Surface onto another Surface. This method allows for faster iteration, reduced checking overhead, and improved temporal coherency, resulting in higher performance.

The proposed approach involves a new format in fblits:

screen.fblits([ (surf, [pos1, pos2, pos3]), ... ])

This format takes a tuple containing a reference to the surface and a list of positions we want to draw to.

Inside the function, we can use memcpy for blitting the same surface more efficiently and run the checks only once, leading to significant performance improvements.

Conclusions

How to use: To implement this, pass a tuple in the format (Surface_To_Draw, List_Of_Positions_To_Draw):

import pygame
pygame.init()

screen = pygame.display.set_mode((500, 500))

surf = pygame.Surface((20, 20))
surf.fill("red")

positions = [(0, 0), (100, 100), (30, 30)]

screen.fblits([(surf, positions)], 0, True)  # sets the flag to 0 and cache=True to actually use the caching

TODO:

[x] Respect destination clip rect
[x] Respect blitting to a subsurface
[x] Investigate self blitting (ensure it works)
[x] Ability to partially draw a surface
[x] Support for Rect/Frect and more sequences as blit positions
[x] (FUTURE WORK) Support blend flags, alpha and colorkey.

The results so far are very promising: (TO BE UPDATED)

Here is a little program I've been experimenting with to get these numbers:

from random import randint, random

import pygame

pygame.init()

screen_size = 1000

screen = pygame.display.set_mode((screen_size, screen_size))

size = 50
flag = 0

s = pygame.Surface((size, size))
pygame.draw.circle(s, (100, 0, 255), (size // 2, size // 2), size // 2)

class Particle:
    def __init__(self, x, y, vx, vy):
        self.x = x
        self.y = y
        self.vx = vx
        self.vy = vy

    def update(self, dt):
        self.x += self.vx * dt
        self.y += self.vy * dt

        if self.x < 0:
            self.x = 0
            self.vx *= -1
        elif self.x > screen_size - size - 1:
            self.x = screen_size - size - 1
            self.vx *= -1

        if self.y < 0:
            self.y = 0
            self.vy *= -1
        elif self.y > screen_size - size - 1:
            self.y = screen_size - size - 1
            self.vy *= -1

        return self.x, self.y

particles = [
    Particle(randint(0, screen_size - size), randint(0, screen_size - size),
             random() * 2 - 1, random() * 2 - 1)
    for _ in range(10000)
]

clock = pygame.time.Clock()

font = pygame.font.SysFont("Arial", 25, True)

modes = ["fblits", "blits", "fblits cached"]
mode_index = 0
mode = modes[mode_index]

while True:
    dt = clock.tick_busy_loop(1000) * 60 / 1000
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            pygame.quit()
            exit()
        elif event.type == pygame.MOUSEWHEEL:
            mode_index = (mode_index + event.y) % len(modes)
            mode = modes[mode_index]
        elif event.type == pygame.KEYDOWN:
            if event.key == pygame.K_0:
                flag = 0

    screen.fill(0)

    if mode == "fblits":
        screen.fblits([(s, p.update(dt)) for p in particles], flag)
    elif mode == "blits":
        screen.blits([(s, p.update(dt), None, flag) for p in particles])
    elif mode == "fblits cached":
        screen.fblits(
            [
                (s, [p.update(dt) for p in particles])
            ],
            flag
        )

    fps_text = font.render(
        "FPS: " + f"{int(clock.get_fps())}" + f"| N: {len(particles)}",
        True, (255, 255, 255))
    pygame.draw.rect(screen, (0, 0, 0),
                     fps_text.get_rect(center=(screen_size / 2, screen_size / 2)))
    screen.blit(fps_text, fps_text.get_rect(center=(screen_size / 2, screen_size / 2)))

    mode_text = font.render(f"{mode}", True, (255, 255, 255))
    pygame.draw.rect(screen, (0, 0, 0),
                     mode_text.get_rect(center=(screen_size / 2, screen_size / 2 + 25)))
    screen.blit(mode_text,
                mode_text.get_rect(center=(screen_size / 2, screen_size / 2 + 25)))

    pygame.display.flip()

Starbuck5 commented 5 months ago

I don't really understand the proposal, but piping in on this:

Create a register cache (of __m256i) and load the Surface’s pixels once.

It sounds like you want to store an arbitrary amount of data in SIMD registers-- that's not possible. There are a set amount of these registers in each processor core, like 16. http://www.infophysics.net/amd64regs.png

XMM are the SSE registers, YMM are the AVX registers, ZMM (not shown in that diagram) are the AVX512 registers.

The intrinsics look like they're giving us registers, but it doesn't compile as 1-for-1 as it seems. I think some are routines have more intrinsics variables than there are physical registers.

itzpr3d4t0r commented 5 months ago

I don't really understand the proposal, but piping in on this:

Create a register cache (of __m256i) and load the Surface’s pixels once.

It sounds like you want to store an arbitrary amount of data in SIMD registers-- that's not possible. There are a set amount of these registers in each processor core, like 16. http://www.infophysics.net/amd64regs.png

The intrinsics look like they're giving us registers, but it doesn't compile as 1-for-1 as it seems. I think some are routines have more intrinsics variables than there are physical registers.

I'm aware of all of this but it looked like it was working anyway. But with further experimentation i found out that using memcpy directly yielded similar if not better results while not requiring any heap allocation. This is true for blitcopy alone tho as i still need to check for other blendmodes.

MyreMylar commented 4 months ago

Had a little peek at this.

Are you intending to add alpha blending support, or special blend mode support, in this PR or in a future one?

pygame-community / pygame-ce

Faster `fblits()` for multiple blits of the same Surface #2825

Optimizing Multi Image Blitting

Conclusions