pygame / pygame

๐Ÿ๐ŸŽฎ pygame (the library) is a Free and Open Source python programming language library for making multimedia applications like games built on top of the excellent SDL library. C, Python, Native, OpenGL.
https://www.pygame.org
7.43k stars 3.31k forks source link

New alpha blitter much slower on Raspberry PI #2370

Closed ghost closed 3 years ago

ghost commented 3 years ago

Config

RaspberyPi 3b+, latest official raspberrypi os (32bits), everything default pygame installed from pip (and the latest dev compiled to make sure this wasn't fixed since 2.0)

Problem

I was trying some pygame stuff on raspberry pi and I it seems that some strange things are going on with the new alpha blitter #2243

Anything using per pixel alpha blitting is so much slower than with pygame1 it's sometimes unusable (10 to 20x slower)

It looks like it's related to the new alpha blitter #2243 .

Tests

Running the tests (#2243) posted by @MyreMylar. (Only the even blit width results because there is not much difference between even and odd) pygame 1.9.6

tested Blit no alpha              : 1809.057ms
tested Blit surface alpha         : 5245.058ms
tested Blit pixel to opaque alpha : 2610.661ms
tested Blit pixel to pixel alpha  : 40830.153ms
tested Blit pixel and surf alpha  : 2603.336ms

pygame 2.0.1.dev1 (SDL 2.0.9, python 3.7.3) With PYGAME_BLEND_ALPHA_SDL2='1'

tested Blit no alpha              : 1791.832ms
tested Blit surface alpha         : 4833.745ms
tested Blit pixel to opaque alpha : 2610.402ms
tested Blit pixel to pixel alpha  : 2667.952ms
tested Blit pixel and surf alpha  : 13076.423ms

pygame 2.0.1.dev1 (SDL 2.0.9, python 3.7.3) New blitter

tested Blit no alpha              : 1807.6ms
tested Blit surface alpha         : 4887.749ms
tested Blit pixel to opaque alpha : 66528.607ms
tested Blit pixel to pixel alpha  : 59373.177ms
tested Blit pixel and surf alpha  : 66369.716ms

The results for no alpha and surface alpha are similar, but the others are much, much worse. The biggest problem is pixel to opaque. It's similar with pygame1 and 2 (SDL2), but with the new alpha blitter, it's 25 times slower (on other platforms, like my mac, the new blitter is actually faster than the SDL2 one).

Anyway, this alpha blit optimization is way over my head ๐Ÿค” ... but hopefully this makes sense to someone ๐Ÿ˜„

Thanks

illume commented 3 years ago

Thanks for the report.

Oops. I think we still need to handle optimization flags automatically.

Probably have better results this... I wonder what you get?

CFLAGS=-mfpu=neon python3 setup.py build
MyreMylar commented 3 years ago

The blitter being 10 times slower on pi is pretty much what would be expected without enabling NEON as it will just use the non-neon normal CPU math path.

Probably we could quick fix this by just disabling the new blitter path on Pi. If I recall the original problem with Pi is that if you want to maintain compatibility with Raspberry Pi 1 you can't use the NEON/SIMD registers because the Pi 1 didn't have them.

The only problem with going that way is that the SDL2 alpha blitters are kind of bug ridden in SDL2 and written in assembly that nobody else understands.

I think we were never able to test if code built with neon optimisations enabled still worked on the Pi 1 because nobody actually had a Pi 1. There were also some ideas floated about putting all the SIMD code into a separate C file IIRC, but I don't remember exactly what that was going to help with.

ghost commented 3 years ago
CFLAGS=-mfpu=neon python3 setup.py build

This doesn't change anything, at least on my 3B+. I guess it's a choice between:

illume commented 3 years ago

This doesn't change anything, at least on my 3B+.

Ok, thanks for trying it out. I guess it needs some debugging.

MyreMylar commented 3 years ago

I believe that you need to do something like:

python setup.py install -enable-arm-neon

But it has been a while since I added it so I may have it slightly wrong.

On Thu, 3 Dec 2020, 09:54 Renรฉ Dudfield, notifications@github.com wrote:

This doesn't change anything, at least on my 3B+.

Ok, thanks for trying it out. I guess it needs some debugging.

โ€” You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/pygame/pygame/issues/2370#issuecomment-737806809, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGDGGSXXYWI4P3MSYHLH6DSS5N6PANCNFSM4UIMUGWQ .

ghost commented 3 years ago

I believe that you need to do something like: python setup.py install -enable-arm-neon But it has been a while since I added it so I may have it slightly wrong.

Yes, it looks like it is -enable-arm-neon in setup.py. I will try it when I have time in the next days. ๐Ÿ˜„

ghost commented 3 years ago

It looks like there is an error when trying to compile the new alpha blitter with neon on arm.

src_c/include/sse2neon.h:1427:17: error: incompatible types when assigning to type '__m128i' from type 'int'
             ret = a;

src_c/alphablit.c:2618:47: error: incompatible type for argument 1 of 'vreinterpretq_u16_s64'
                 mm_sub_alpha = _mm_srli_epi16(_mm_mulhi_epu16(mm_sub_alpha,
                                               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                           _mm_set1_epi16((short)0x8081)), 7);
                                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
MyreMylar commented 3 years ago

Looks like sse2neon hasn't implemented _mm_mulhi_epu16 for neon which makes it generate this error.

There is a pull request here: https://github.com/DLTcollab/sse2neon/pull/221

To implement it from 20 days ago but it hasn't been merged yet. I'll keep an eye on that thread and pull over an updated version if it gets merged.

illume commented 3 years ago

Interesting. Looks like the PR is buggy, so we can't really use it as is.

simdeverywhere seems to have that instruction since 2017.

MyreMylar commented 3 years ago

Hmm, looks like this repo merged sse2neon into it at some point so perhaps it is generally more complete than the fork we are using.

However, it's a much larger repo (10x in just code size and it has many more files) than sse2neon which was just a single large header file we were able to drop in. I'm not sure we could use simdeverywhere in the same way. The code changes in pygame wouldn't be too bad, but it seems like it would be bad practice to drop a repo of that size into a subfolder of pygame somewhere.

howjmay commented 3 years ago

@MyreMylar The PR has been merged. Sorry for waiting

MyreMylar commented 3 years ago

@HowJMay thankyou, no worries for the wait, I've not had much time for. coding recently anyway so I wasn't being held up.

I will look at getting this change over to pygame tomorrow though and see If it will make the intrinsic blogger path compile for Arm.

MyreMylar commented 3 years ago

Got it to compile on my Raspberry Pi with the updated header and the -enable-arm-neon flag. Here's the performance test results:

Pygame new alpha blitter
----------------------------------

Even blit width:
-----------
tested Blit no alpha            :       643.417ms
tested Blit surface alpha       :       2220.227ms
tested Blit pixel to opaque alpha:          3813.791ms
tested Blit pixel to pixel alpha:           6221.064ms
tested Blit pixel and surf alpha:           15450.189ms

Odd blit width:
-----------
tested Blit no alpha            :       667.032ms
tested Blit surface alpha       :       2225.092ms
tested Blit pixel to opaque alpha:          5881.047ms
tested Blit pixel to pixel alpha:           9227.48ms
tested Blit pixel and surf alpha:           15300.087ms

Versus -

SDL2 alpha blitter
------------------------

Test 0: 1000 blits of image size 750, 1050

Even blit width:
-----------
tested Blit no alpha            :       757.062ms
tested Blit surface alpha       :       2161.29ms
tested Blit pixel to opaque alpha:          1424.41ms
tested Blit pixel to pixel alpha:           1467.83ms
tested Blit pixel and surf alpha:           9620.407ms

Odd blit width:
-----------
tested Blit no alpha            :       730.243ms
tested Blit surface alpha       :       2111.845ms
tested Blit pixel to opaque alpha:          1407.704ms
tested Blit pixel to pixel alpha:           1404.866ms
tested Blit pixel and surf alpha:           9539.085ms

Which is a lot closer, generally, and about what I'd expect trading SDLs hand-tuned-for-pi (but visually wrong) assembly for multi-platform ported intrinsic functions that have to do a bit more work both to be converted from SSE2 and because they are doing the correct calculations.

pygame's 'surface_test' module also runs through successfully on my Pi.

Oddly, I do have to comment out a new pragma section at the start of the sse2neon header to make the header work on my pi. I'll attach the PR updating SSE2neon in a minute once I've switched back to PC.

MyreMylar commented 3 years ago

I linked the PR. I don't think it is a final resolution to this issue but it should at least improve the situation.