Unexpected Issues with Real FFT

For educational purposes, I have implemented an Overlap-Save FFT using KFR, however, I have experienced very weird issues.

Take the code for Overlap-Add based on complex to complex FFT:

void overlapSaveConv(
    float const * data,
    float* out,
    float const * filter,
    size_t nSamples,
    size_t filterLength,
    size_t FFTSize
) {

    if(FFTSize < (filterLength + 1)) throw std::invalid_argument("FFT Size must be at least one sample larger than filter length!");
    if(FFTSize > nSamples) throw std::invalid_argument("FFTSize must not be greater than signal length! (currently giving wrong results in that case!)");

    //N = L + M - 1
    //N = L + filterLength - 1
    //=> L = N - M + 1

    const size_t N = FFTSize;
    const size_t M = filterLength;
    const size_t step_size = N - (M - 1);

    //We need to zero pad the filter
    std::vector<kfr::complex<float>> paddedFilter; paddedFilter.resize(FFTSize);
    for(size_t index = 0; index < filterLength; ++index) paddedFilter[index] = kfr::complex<float>{filter[index], 0.0f};
    //Compute Fourier Transform of filter, that will be used a lot of times in the loop
    auto H = kfr::dft(kfr::make_univector(paddedFilter));

    //As far as I understand, it is not (easily) possible to implement overlap save in place
    //This is due to the fact that we prepend M-1 / overlap zeros and then go stepsize further
    //However, if we go stepsize further, the beginning of the data we need has already been overwritten by the previous block
    //results (due to the offset from the zeros)

    //Two possibilities:
    // - Copy the input array
    // - Separate buffer for the output array

    //As copying the input array allows us to get rid of logic inside the loop
    //(recognise whether we need to prepend or append zeros), this is our choice.

    //We need to prepend and append zeros to our data to account for transients (prepend) and the block size (append)
    std::vector<kfr::complex<float>> paddedData;
    paddedData.resize(
        M - 1 + //Prepend M-1 (filterLength - 1) zeros at the beginning
        nSamples + //Space for the actual samples
        FFTSize //Append zeros at the end so we don't need to worry about in the loop
    );
    for(size_t index = 0; index < nSamples; ++index) paddedData[index + M - 1] = kfr::complex<float>{data[index], 0.0f};

    size_t i = 0;
    while(i < nSamples) {

        //Gather the data of our current block and perform an FFT
        auto xSamplesUnivector = kfr::make_univector(paddedData.data() + i, FFTSize);
        auto fftx = kfr::dft(xSamplesUnivector);

        //Apply the filter to our current block
        fftx = fftx * H;

        //Transform back to the time domain and apply proper scaling!
        auto yt = kfr::dft(fftx);
        yt /= double(yt.size());

        //Copy our result to the output
        for(size_t index = 0; index < std::min((N - (M - 1)), nSamples - i + 1); ++index) {
            out[index + i] = yt[index + (M - 1)].real();
        }

        i += step_size;
    }
}

which works perfectly fine!

But now, if I convert this code to real FFT (minimally invasive):

void overlapSaveConv(
    float const * data,
    float* out,
    float const * filter,
    size_t nSamples,
    size_t filterLength,
    size_t FFTSize
) {

    if(FFTSize < (filterLength + 1)) throw std::invalid_argument("FFT Size must be at least one sample larger than filter length!");
    if(FFTSize > nSamples) throw std::invalid_argument("FFTSize must not be greater than signal length! (currently giving wrong results in that case!)");

    //N = L + M - 1
    //N = L + filterLength - 1
    //=> L = N - M + 1

    const size_t N = FFTSize;
    const size_t M = filterLength;
    const size_t step_size = N - (M - 1);

    //We need to zero pad the filter
    std::vector</*kfr::complex<*/float/*>*/> paddedFilter; paddedFilter.resize(FFTSize);
    for(size_t index = 0; index < filterLength; ++index) paddedFilter[index] = /*kfr::complex<float>{*/filter[index]/*, 0.0f}*/;
    //Compute Fourier Transform of filter, that will be used a lot of times in the loop
    auto H = kfr::realdft(kfr::make_univector(paddedFilter));

    //As far as I understand, it is not (easily) possible to implement overlap save in place
    //This is due to the fact that we prepend M-1 / overlap zeros and then go stepsize further
    //However, if we go stepsize further, the beginning of the data we need has already been overwritten by the previous block
    //results (due to the offset from the zeros)

    //Two possibilities:
    // - Copy the input array
    // - Separate buffer for the output array

    //As copying the input array allows us to get rid of logic inside the loop
    //(recognise whether we need to prepend or append zeros), this is our choice.

    //We need to prepend and append zeros to our data to account for transients (prepend) and the block size (append)
    std::vector</*kfr::complex<*/float/*>*/> paddedData;
    paddedData.resize(
        M - 1 + //Prepend M-1 (filterLength - 1) zeros at the beginning
        nSamples + //Space for the actual samples
        FFTSize //Append zeros at the end so we don't need to worry about in the loop
    );
    for(size_t index = 0; index < nSamples; ++index) paddedData[index + M - 1] = /*kfr::complex<float>{*/data[index]/*, 0.0f}*/;

    size_t i = 0;
    while(i < nSamples) {

        //Gather the data of our current block and perform an FFT
        auto xSamplesUnivector = kfr::make_univector(paddedData.data() + i, FFTSize);
        auto fftx = kfr::realdft(xSamplesUnivector);

        //Apply the filter to our current block
        fftx = fftx * H;

        //Transform back to the time domain and apply proper scaling!
        auto yt = kfr::irealdft(fftx);
        yt /= double(yt.size());

        //Copy our result to the output
        for(size_t index = 0; index < std::min((N - (M - 1)), nSamples - i + 1); ++index) {
            out[index + i] = yt[index + (M - 1)]/*.real()*/;
        }

        i += step_size;
    }
}

Results are, well, weird. (blue is the reference curve calculated by KFR convolution implementation)

After rereading the docs about a bazillion times, I'm still not sure what is happening. The real fft should work as a drop-in replacement for the complex-to-complex fft in this case, right?

Version used is 4.2.1 (9fc73247f43b303617329294ae264613df4dce71)

(Please note that I have eliminated all memcpy optimisation etc. for more clarity and to make sure it's really just related to complex-to-complex vs real-to-complex fft)

kfrlib / kfr

Unexpected Issues with Real FFT #134