jurihock / voyx

Standalone real time dynamic vocal harmonizer
GNU General Public License v3.0
22 stars 3 forks source link

Experiments on accelerating algorithms #1

Open jurihock opened 2 years ago

jurihock commented 2 years ago
jurihock commented 2 years ago

vDSP

The vDSP library looks interesting, since it can be mixed with the native C++ code. However the complex number arithmetic appears to require the "split complex" memory layout instead of the regular interleaved std::complex representation.

The forward SDFT can be implemented by following sequence:

vDSP_vsadd
vDSP_zvmul
vDSP_zvadd
vDSP_zvsub
vDSP_zvsub
vDSP_zvzsml

Compared to the vanilla C++ implementation I can't see any significant performance difference, just same time measurements and same CPU usage, so 👎.

The allocated memory is aligned by default as required and the clang compiler seems to be doing its job very well.

According to the LLVM docs the auto-vectorization is on by default. E.g. explicitly switched off via compiler flags -fno-vectorize -fno-slp-vectorize the difference is noticeable.

Metal

The first Metal experiment shows a typical "command queue" overhead problem. Although the SDFT can be computed in parallel for a single sample, the equal computation needs to be sequentially repeated for all samples of the frame buffer. Maybe an indirect command encoding can help to deal with that.

OpenCL

Just same story as Metal... The OpenCL 2.0 spec describes a mechanism of enqueuing kernels from kernels. Still not sure, if and how long the OpenCL 2.0 will be supported by Apple. It's still OpenCL 1.2 in 2022...

Limiting signal bandwidth

Probably the fastest way of computing SDFT is not computing it at all... One main feature of the SDFT is arbitrary spectral resolution and thus the possibility of limiting the signal bandwidth to save CPU cycles.

As long as the source signal bandwidth is known in advance, there is no need to compute all spectral bands at analysis step. At synthesis step, the destination signal bandwidth can also be adjusted according to the applied pitch shifting factor.

Utilize both CPU and GPU simultaneously

If delayed by one frame, the computation task can be spread between CPU and GPU. For example in case of SDFT the frame size can be reduced to something like 64 or 32 samples, which will result a latency of about 1 ms at 44,1 kHz and is still an order of magnitude better than STFT.

Reduce sample rate

This is currently the most useful hack, which is actually an another way of bandwidth limitation.

E.g. sample rate conversion 48000 (adc) => 16000 (dsp) => 48000 (dac) works just fine on the CPU with headroom for spectral processing.