Closed ghost closed 3 years ago
Thanks for the report.
Oops. I think we still need to handle optimization flags automatically.
Probably have better results this... I wonder what you get?
CFLAGS=-mfpu=neon python3 setup.py build
The blitter being 10 times slower on pi is pretty much what would be expected without enabling NEON as it will just use the non-neon normal CPU math path.
Probably we could quick fix this by just disabling the new blitter path on Pi. If I recall the original problem with Pi is that if you want to maintain compatibility with Raspberry Pi 1 you can't use the NEON/SIMD registers because the Pi 1 didn't have them.
The only problem with going that way is that the SDL2 alpha blitters are kind of bug ridden in SDL2 and written in assembly that nobody else understands.
I think we were never able to test if code built with neon optimisations enabled still worked on the Pi 1 because nobody actually had a Pi 1. There were also some ideas floated about putting all the SIMD code into a separate C file IIRC, but I don't remember exactly what that was going to help with.
CFLAGS=-mfpu=neon python3 setup.py build
This doesn't change anything, at least on my 3B+. I guess it's a choice between:
This doesn't change anything, at least on my 3B+.
Ok, thanks for trying it out. I guess it needs some debugging.
I believe that you need to do something like:
python setup.py install -enable-arm-neon
But it has been a while since I added it so I may have it slightly wrong.
On Thu, 3 Dec 2020, 09:54 Renรฉ Dudfield, notifications@github.com wrote:
This doesn't change anything, at least on my 3B+.
Ok, thanks for trying it out. I guess it needs some debugging.
โ You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/pygame/pygame/issues/2370#issuecomment-737806809, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGDGGSXXYWI4P3MSYHLH6DSS5N6PANCNFSM4UIMUGWQ .
I believe that you need to do something like:
python setup.py install -enable-arm-neon
But it has been a while since I added it so I may have it slightly wrong.
Yes, it looks like it is -enable-arm-neon
in setup.py. I will try it when I have time in the next days. ๐
It looks like there is an error when trying to compile the new alpha blitter with neon on arm.
src_c/include/sse2neon.h:1427:17: error: incompatible types when assigning to type '__m128i' from type 'int'
ret = a;
src_c/alphablit.c:2618:47: error: incompatible type for argument 1 of 'vreinterpretq_u16_s64'
mm_sub_alpha = _mm_srli_epi16(_mm_mulhi_epu16(mm_sub_alpha,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
_mm_set1_epi16((short)0x8081)), 7);
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Looks like sse2neon hasn't implemented _mm_mulhi_epu16
for neon which makes it generate this error.
There is a pull request here: https://github.com/DLTcollab/sse2neon/pull/221
To implement it from 20 days ago but it hasn't been merged yet. I'll keep an eye on that thread and pull over an updated version if it gets merged.
Interesting. Looks like the PR is buggy, so we can't really use it as is.
simdeverywhere seems to have that instruction since 2017.
Hmm, looks like this repo merged sse2neon into it at some point so perhaps it is generally more complete than the fork we are using.
However, it's a much larger repo (10x in just code size and it has many more files) than sse2neon which was just a single large header file we were able to drop in. I'm not sure we could use simdeverywhere in the same way. The code changes in pygame wouldn't be too bad, but it seems like it would be bad practice to drop a repo of that size into a subfolder of pygame somewhere.
@MyreMylar The PR has been merged. Sorry for waiting
@HowJMay thankyou, no worries for the wait, I've not had much time for. coding recently anyway so I wasn't being held up.
I will look at getting this change over to pygame tomorrow though and see If it will make the intrinsic blogger path compile for Arm.
Got it to compile on my Raspberry Pi with the updated header and the -enable-arm-neon
flag. Here's the performance test results:
Pygame new alpha blitter
----------------------------------
Even blit width:
-----------
tested Blit no alpha : 643.417ms
tested Blit surface alpha : 2220.227ms
tested Blit pixel to opaque alpha: 3813.791ms
tested Blit pixel to pixel alpha: 6221.064ms
tested Blit pixel and surf alpha: 15450.189ms
Odd blit width:
-----------
tested Blit no alpha : 667.032ms
tested Blit surface alpha : 2225.092ms
tested Blit pixel to opaque alpha: 5881.047ms
tested Blit pixel to pixel alpha: 9227.48ms
tested Blit pixel and surf alpha: 15300.087ms
Versus -
SDL2 alpha blitter
------------------------
Test 0: 1000 blits of image size 750, 1050
Even blit width:
-----------
tested Blit no alpha : 757.062ms
tested Blit surface alpha : 2161.29ms
tested Blit pixel to opaque alpha: 1424.41ms
tested Blit pixel to pixel alpha: 1467.83ms
tested Blit pixel and surf alpha: 9620.407ms
Odd blit width:
-----------
tested Blit no alpha : 730.243ms
tested Blit surface alpha : 2111.845ms
tested Blit pixel to opaque alpha: 1407.704ms
tested Blit pixel to pixel alpha: 1404.866ms
tested Blit pixel and surf alpha: 9539.085ms
Which is a lot closer, generally, and about what I'd expect trading SDLs hand-tuned-for-pi (but visually wrong) assembly for multi-platform ported intrinsic functions that have to do a bit more work both to be converted from SSE2 and because they are doing the correct calculations.
pygame's 'surface_test' module also runs through successfully on my Pi.
Oddly, I do have to comment out a new pragma section at the start of the sse2neon header to make the header work on my pi. I'll attach the PR updating SSE2neon in a minute once I've switched back to PC.
I linked the PR. I don't think it is a final resolution to this issue but it should at least improve the situation.
Config
RaspberyPi 3b+, latest official raspberrypi os (32bits), everything default pygame installed from pip (and the latest dev compiled to make sure this wasn't fixed since 2.0)
Problem
I was trying some pygame stuff on raspberry pi and I it seems that some strange things are going on with the new alpha blitter #2243
Anything using per pixel alpha blitting is so much slower than with pygame1 it's sometimes unusable (10 to 20x slower)
It looks like it's related to the new alpha blitter #2243 .
Tests
Running the tests (#2243) posted by @MyreMylar. (Only the even blit width results because there is not much difference between even and odd) pygame 1.9.6
pygame 2.0.1.dev1 (SDL 2.0.9, python 3.7.3) With PYGAME_BLEND_ALPHA_SDL2='1'
pygame 2.0.1.dev1 (SDL 2.0.9, python 3.7.3) New blitter
The results for no alpha and surface alpha are similar, but the others are much, much worse. The biggest problem is pixel to opaque. It's similar with pygame1 and 2 (SDL2), but with the new alpha blitter, it's 25 times slower (on other platforms, like my mac, the new blitter is actually faster than the SDL2 one).
Anyway, this alpha blit optimization is way over my head ๐ค ... but hopefully this makes sense to someone ๐
Thanks