swesterfeld / spectmorph

SpectMorph: spectral audio morphing
http://www.spectmorph.org
GNU Lesser General Public License v2.1
64 stars 5 forks source link

LIB: add support for ARM NEON #11

Closed onereddogmedia closed 2 years ago

swesterfeld commented 2 years ago

That is an interesting approach to support SIMD optimizations on ARM without changing the existing algorithms. I have a few comments:

As submitted, it doesn't actually optimize anything, because all the existing SSE code checks if __SSE__ is defined. You can see that there are three files that could potentially provide better performance through vectorization:

$ grep -l __SSE__ */*hh */*cc
lib/smnoisedecoder.cc
lib/smpandaresampler.cc
lib/smmath.hh

So at least these __SSE__ checks need to be changed (maybe to something like #if defined (__SSE__) || defined (SM_ARM_SSE) or similar. All in all we have three vectorized algorithms:

As for any performance optimization, it is probably a good idea to check if the performance is actually improved by the vectorized code. I cannot do this, because I don't have ARM hardware.

For the NoiseDecoder, running tests/testnoiseperf would be interesting. For fast_vector_sin, running tests/testfastsin perf with and without your ARM vectorization code would be interesting.

PandaResampler is special, because its code is in its own repository (https://github.com/swesterfeld/pandaresampler). So since this is a header-only library you'd have to copy-paste the necessary _mm_mul_ps and _mm_add_ps definitions directly into the lib/pandaresampler.cc file, and then run tests/testresampler perf --fpu and compare it with tests/testresampler perf. Later, the SpectMorph PandaResampler would have to be updated by the new pandaresampler.cc.

As for licensing, I've recently changed SpectMorph to LGPL v2.1 or later (https://github.com/swesterfeld/spectmorph/commit/c59dd653da6ea22c24367b99808e786b54773f12), so this would be needed for vectorizing NoiseDecoder and fast_vector_sin. PandaResampler is MPL2.0.

If you've implemented all code yourself it would be great if you'd allow it to be used under both licenses. I also see the possibility that your code could be used in Anklang (https://anklang.testbit.eu/) which is also MPL2.0. If you copypasted your code from elsewhere, the original license would also need to be checked.

onereddogmedia commented 2 years ago

Thanks for replying. This approach to supporting SIMD on ARM NEON is similar to that of the SIMDe library. The SIMD intrinsics are almost the same between NEON and SSE. Interestingly there's not a huge difference in performance on the Apple M1. I have a subset of the SpectMorph running in Xcode, I'm curious to see the timings on the slower Apple ARM processors as used in iPad, I'll update you.

Sorry, I haven't looked at PandaResampler.

License. I marked it MPL, but it might be MIT. The macros are basically from SIMDe without all their preprocessor flags.

swesterfeld commented 2 years ago

I've tried to merge various bits and pieces from your PR into master. It is not a 1:1 merge, I did some changes while merging. Still, I hope you should be able to build from master now, either without changes or with minimal changes. I'd be interested in your feedback on this.

Sorry, I haven't looked at PandaResampler.

Ok, finally I found some time to add NEON support based on your code to PandaResampler. You'll find it in the master branch, with a new performance test. On my devel system (Ryzen 7) SSE is about 3 times faster.

$ tests/testpandaperf 
SSE: 5.57 ns/sample
FPU: 16.61 ns/sample

SSE/FPU speedup: 2.98

I'd be interested in seeing the results of this test using NEON on your M1 system.

onereddogmedia commented 2 years ago

Apologies for the late reply. I get the following results MacBook Air (M1, 2020):

$ ./testpandaperf
SSE: 3.76 ns/sample
FPU: 13.55 ns/sample

SSE/FPU speedup: 3.61
onereddogmedia commented 2 years ago

Macbook Pro (14-inch, 2021, M1 Pro)

$ ./tests/testpandaperf 
SSE: 3.65 ns/sample
FPU: 13.42 ns/sample

SSE/FPU speedup: 3.68
swesterfeld commented 2 years ago

Ok, the results for PandaResampler on M1 look really good. So after all, your code / suggestion to SIMDify stuff using NEON on M1 really improves performance for the resampling code.

As far as I see everything is merged into master and tested now, so I'll close the PR. Thanks!