Optimized `transform.scale2x()`

pygame-community / pygame-ce

🐍🎮 pygame - Community Edition is a FOSS Python library for multimedia applications (like games). Built on top of the excellent SDL library.

https://pyga.me

849 stars 140 forks source link

Optimized `transform.scale2x()` #2859

Closed itzpr3d4t0r closed 3 months ago

itzpr3d4t0r commented 4 months ago

This PR optimizes all bpp cases for the transform.scale2x() function. It does so through better use of pointers / switch to static inline functions instead of macros that repeat some calculations and one optimization suggestion in https://www.scale2x.it/algorithm.

Performance improvements vary on a case by case basis depending on suface size and bpp, so I'll just leave my results here (top one is 32bpp): Figure_1

Starbuck5 commented 4 months ago

How would switching from a macro to a static inline function have any impact on performance? They both compile into the caller.

itzpr3d4t0r commented 4 months ago

How would switching from a macro to a static inline function have any impact on performance? They both compile into the caller.

I'm not sure but it should be a combination of factors, especially because the macros repeated ternary ops and other operations whereas the functions do not.

Anyways I tested for this and no macros = 30-35% performance improvement:

Figure_1

Take a READINT24 macro for example, it expands to this code:

((srcpix + ((((0) > (looph - 1)) ? (0) : (looph - 1)) * srcpitch) + (3 * loopw))[0] << 16 |
 (srcpix + ((((0) > (looph - 1)) ? (0) : (looph - 1)) * srcpitch) + (3 * loopw))[1] << 8 |
 (srcpix + ((((0) > (looph - 1)) ? (0) : (looph - 1)) * srcpitch) + (3 * loopw))[2])

So it does the same calculation thrice. With a function we pass in that pointer with the calculations in once and just index it so we save 2 calculations. So since we have 5 calls, we get 20 calculations with macros and 5 without.

This still applies for the WRITEINT24 macro.

Gabryel-lima commented 4 months ago

This PR optimizes all bpp cases for the transform.scale2x() function. It does so through better use of pointers / switch to static inline functions instead of macros that repeat some calculations and one optimization suggestion in https://www.scale2x.it/algorithm.

Performance improvements vary on a case by case basis depending on suface size and bpp, so I'll just leave my results here (top one is 32bpp):

What are these so-called “macros”? And by the way, the dash-board is really cool 😁

ankith26 commented 4 months ago

It is interesting to see that pre-this-pr the performance of scale2x is actually worse than the generic scale for your usage.

Starbuck5 commented 4 months ago

What are these so-called “macros”? And by the way, the dash-board is really cool 😁

@Gabryel-lima macros are a concept in C and C++. https://www.geeksforgeeks.org/macros-and-its-types-in-c-cpp/

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

MyreMylar commented 4 months ago

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

How exactly does this happen? Shouldn't locking/unlocking by surfarray and this function prevent multiple threads accessing the memory at the same time. Is there another way that this memory could be accessed simultaneously, outside of a multithreaded context, that I'm blanking on? My computer science educational background is sometimes spotty.

itzpr3d4t0r commented 4 months ago

I have to admit restrict is sort of a weird optimization path, I observed some minor-medium benefits but only in specific size ranges so I guess it's not worth the concern.

Gabryel-lima commented 4 months ago

What are these so-called “macros”? And by the way, the dash-board is really cool 😁

@Gabryel-lima macros are a concept in C and C++. https://www.geeksforgeeks.org/macros-and-its-types-in-c-cpp/

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

@Starbuck5 Wow, very interesting! Thanks!

Starbuck5 commented 4 months ago

@itzpr3d4t0r I don't think we can use restrict here, because technically there could be other instances of things accessing the surface pixels at the same time right? Like with surfarray? I'd be interested to see if there's actually a significant difference between using restrict and not using restrict.

How exactly does this happen? Shouldn't locking/unlocking by surfarray and this function prevent multiple threads accessing the memory at the same time. Is there another way that this memory could be accessed simultaneously, outside of a multithreaded context, that I'm blanking on? My computer science educational background is sometimes spotty.

It should, I'm just being cautious... Maybe if you constructed a Surface with shared memory from another object you could edit it while running something else?

I don't have a solid scenario in mind.