Performance analysis of surface.fill

Starbuck5 commented 2 days ago

Introduction

Last week I went and profiled a handful of random games, mostly from pygame community game jams. One thing I noticed is that fill was often one of the higher ranked pygame-ce functions, in terms of runtime. Not as expensive as blits or display.flip or anything, but it tends to be up there. Which surprised me. It doesn't seem like it would be that expensive.

I wondered if pygame-ce's internal fill routines could do a better job than SDL's, so I came up with my own alternative to surface.fill():

# Normal
screen.fill("purple")

# Weird
screen.fill("white", special_flags=pygame.BLEND_SUB)
screen.fill("purple", special_flags=pygame.BLEND_ADD)

And I found that even doing these 2 function calls was faster than a normal fill! Amazing! If going through twice is faster, we could easily make our special_flags=0 routine to take over fill() and do it more efficiently and show a noticeable speed improvement to a typical game.

However, the story is not that simple. I tested a larger surface and SDL was now faster. What gives? Why is SDL better at large surfaces and we are better at small ones, and is there anything we can contribute to them or learn for ourselves from this?

Data

fill() seconds taken (20k repetitions), different strategies

fill() nanoseconds per pixel, different strategies

Raw data: https://docs.google.com/spreadsheets/d/1WBCVvzkL9HAZJ7Yo1N86-tAFhAP0d2J72Wcp8mveCl4/edit?gid=2144097095#gid=2144097095

Benchmarking script

```py import time import pygame pygame.init() size = 1550 screen = pygame.Surface((size,size)) print(screen, screen.get_pitch()) def fill_purple_normal(): print("Normal fill") screen.fill("purple") print(screen.get_at((0,0))) # Color(160, 32, 240, 255) start = time.time() for _ in range(10000): screen.fill("purple") print(time.time() - start) # Takes about 0.19 seconds on my system def fill_purple_weird(): print("SUB-ADD fill") screen.fill("white", special_flags=pygame.BLEND_SUB) screen.fill("purple", special_flags=pygame.BLEND_ADD) print(screen.get_at((0,0))) # Color(160, 32, 240, 255) start = time.time() for _ in range(10000): screen.fill("white", special_flags=pygame.BLEND_SUB) screen.fill("purple", special_flags=pygame.BLEND_ADD) print(time.time() - start) # Takes about 0.07 seconds on my system fill_purple_normal() fill_purple_weird() ```

Analysis

SDL performance is more bumpy, they seem to like extremely round widths. Width=1200 is significantly faster for them than widths 1050, 1100, 1150, 1250, 1300. I don't think this is an outlier, widths 400 and 800 are also favored by low runtimes in that way
I'm running this on an x86_64 system w/AVX2, so our fills are using our AVX2 routines, they are using SSE.
Our performance gradient is more smooth overall but after 1000x1000 surfaces our time to process each pixel goes up 4-5x to a new stable threshold.
A 1000x1000 32 bpp surface takes up 4MB of memory. My L3 cache is 12MB. Would the amount of cache change the performance threshold?
SDL is using aligned, non temporal stores (they use _mm_stream_ps)
We are using unaligned, normal stores (_mm_storeu_si128, _mm256_storeu_si256)
Source code of what I believe is the SDL routine used here: https://github.com/libsdl-org/SDL/blob/e924f12a7b33678bc71dc96acf3c44142c72c553/src/video/SDL_fillrect.c#L30-L95
They have accelerators for ARM as well, so I'm curious to see if on ARM they would be faster than us in every scenario. Like there could be a world where we make our own internal FillRect implementation for x86 but still call theirs on ARM.

This is a very open ended issue, I mainly want to bring up what I've found to those who might also be interested. Potentially we can contribute something to SDL or learn something from their strategy to improve our own.

@MyreMylar @itzpr3d4t0r

Starbuck5 commented 2 days ago

For all the data I gathered, I haven't actually tried to make a replacement for SDL_FillRect() using our SIMD strategies yet. I see that all our filler code has to load and unpack the source surface, which a replacement for SDL_FillRect() wouldn't need to do, it would just need to broadcast. Maybe the loading and unpacking is what throws our performance off a cliff with large surfaces versus theirs (which does no loading and unpacking because it doesn't need it!)

itzpr3d4t0r commented 1 day ago

Cool to see some research being done in this field. I've done some similar research in the past:

This has everything to do with the cache size and internal cache organization, but mainly size. I used your very same program but with bigger surfaces and got this:

As you can see the weird strategy is faster up until around 2000X2000 surfaces, which is 4M pixels, or in other words 16MB, multiply that by 2 calls to .fill() and you get 32MB which is exactly my L3 cache size. In your case it's not exactly that but cache is also used by other programs and your system could've had more programs open at the time that consumed a bit more cache. SDL's implementation uses nontemporal memory accesses for this very reason, your processor won't care about your data being reused later and simply uses it in a calculation and potentially removes it from the cache the following clock, yielding better cache use (which is good for big surfaces). This is an interesting read about nontemporal memory access. Also this is another very good video explaining how algorithms and caches work: here

Starbuck5 commented 14 hours ago

The program should write entire cache lines at a time using non-temporal stores. Writing partial cache lines may lead to severe performance degradations.

(From your link) Well this answers why SDL's implementation is doing four 128 bit stores per batch! 4 * 128 bits = 64 bytes, the typical size of a cache line. And maybe it also answers why the performance dips so much in less well aligned regions!

I bet ya it is doing partial cache lines because it's only running inside each row of the surface individually. If the cache line crosses the row it's doing normal writes to catch up instead of being able to broadcast across rows.

Starbuck5 commented 14 hours ago

Here is updated data with a quick AVX2 implementation that follows all of our typical strategies. I also realize that before I thought I was running 20k iterations when I was actually running 10k, so that is fixed as well.

fill() seconds taken (10k repetitions), different strategies fill() nanoseconds per pixel, different strategies (1)

This AVX2 implementation is consistently faster than the standard implementation until 1100 by 1100 pixel Surfaces, but it follows the same trend as the prototype SUB-ADD demonstrated.

Also with cache in mind and having looked into the resources you linked @itzpr3d4t0r it also is true that running something in a tight loop may not be the most representative, since typical programs will have lots of other things coming in and out of memory as well, with other blits and things. So maybe that can be simulated by adding more fills() of different surfaces to the loop.

pygame-community / pygame-ce