Since we are doing the exact same thing to 4 different values right now, we should be able to parallelize the computations.
I have already made a SIMD program in Rust that does this using SSE, AVX, and NEON, but it has some problems. The only time there is a performance boost is when either of the following conditions are met:
1) platform = x86_64 and -C target-cpu=native is active
this means that to benefit from the performance boost of SIMD, the SIMD library needs to be built specifically for the X86 CPU, (if the CPU is compatible to have a performance boost)
2) platform = aarch64 and has NEON
this is good to know for when porting to a DSP chip, but... the performance boost is nearly negligible. Processing 4 envelopes takes about 36 nanoseconds with the software implementation, and 27 nanoseconds with NEON. It is a 25% performance boost, but it might not be worth it. It would only really be worth it if there were:
a) many envelopes being computed per process block
b) or a low clock-rate CPU that is running the program, possibly with a high sample rate
I have opened a PR that contains the Rust code if it would be viable at a later date, perhaps with some extra modifications to make it accessible from the C++ code... or I could port the SIMD code to C++. Rust allows the code to be benched pretty easily though.
Since we are doing the exact same thing to 4 different values right now, we should be able to parallelize the computations.
I have already made a SIMD program in Rust that does this using SSE, AVX, and NEON, but it has some problems. The only time there is a performance boost is when either of the following conditions are met:
1) platform =
x86_64
and-C target-cpu=native
is activeaarch64
and hasNEON
I have opened a PR that contains the Rust code if it would be viable at a later date, perhaps with some extra modifications to make it accessible from the C++ code... or I could port the SIMD code to C++. Rust allows the code to be benched pretty easily though.