Closed jurihock closed 2 years ago
Radix-2 [1] and Split-Radix [1,2] looks interesting, but the FORTRAN example in [1] is just unreadable for me. According to [3] the Split-Radix seems to be the fastest option. Sequential output memory layout in each case is currently the drawback, since no direct mapping to std::complex
possible.
The implementation idea in [5] is a combination of [4] with additional "lightweight" post-processing and is easy to understand. But it requires a data buffer due to out-of-place computation, which is not a huge problem anyway.
Implementing Fast Fourier Transform Algorithms of Real-Valued Sequences With the TMS320 DSP Platform provides just another explanation for the (un-)packing the 2N point FFT results. In contrast to [5] the reverse calculation is described too. Equations 12 and 14 seem to work fine. The alternative equations 15 and 16 seem to be buggy...
fakufaku/esp32-fft shows an additional reduction of the (un-)packing loop by N/2...
Performance comparison
Total C++ program execution time to process the voice.wav
file with old pocketfft and new RFFT enabled on M1 Mac:
options | pocketfft | RFFT |
---|---|---|
-p 1 |
0.220s | 0.270s |
-p 1 -q 1 |
0.310s | 0.400s |
Quality comparison
Output files generated by Python script, which still uses pocketfft, and C++ with RFFT enabled are compared.
With -p 1
or -p 1 -q 1
options (e.g. no spectral processing, just STFT with liftering if enabled) the absolute amplitude difference of both output files is max. 0.055.
The RFFT produces no high frequency noise etc. The spectrum plot in Audacity is about the same.
With options -p 0.5
, -p 1
, -p 2
or -p 1,1.25,1.5
no audible difference.
The option combinations -p >=1
and -q 1
are fine.
The only problem is finally fixed in 428932a seems to be the option combinations -p <1
and -q 1
, which is possible the same kind of problem as mentioned in #4. A further comparison with FFTW #6 is necessary. Also check the double precision #8.
Since there are no more obvious issues, the pocketfft will be finally replaced with RFFT and removed from sources.
The factorization trick assumes a complex-valued time domain signal representation. To additionally halve the computational effort and to possibly avoid data caches, a further workaround for real-valued time domain signals is therefore needed: