interp: Specialized, optimized implementations for common interpolations

rawler commented 4 years ago

This is step1 of 2 in optimizing true-peak analysis, optimizing the interpolator itself. The key of the optimization is the realization that for every single new frame (sample N channels) of input, there is 4 (or 2) new outputs generated, and that the Polyphase FIR filter can be seen as a 124 -tap matrix instead of a 49-tap vector (first and last tap is 0 anyways). All the 4 new outputs can then be calculated at once as; sum(input_frame * coefficient_row for [input_frame; coefficient_row] in input_buffer.zip(coefficient_rows)). This calculation is easily auto-vectorized by the compiler, and for stereo 4x upsampling, at least 8 multiplications and additions will be done at a time.

Open question for the PR; the code uses a filter of f32 coefficients. This causes 0.000002 precision-diff with the 64-bit previous implementation, for a small but noticeable speedup, and much simpler generic code. The PR CAN be rewritten keeping the 64-bit filter and accumulator, but the code will be slightly more messy, and some performance lost. As a comparison, ffmpeg:s ebur128 meter only shows TP in dB with a single decimal.

My setup is not perfect for benchmarking (Laptop with speedstep and throttling, but roughly)

Before interp process: 49 taps 2 factors 2ch C [16.324 ms 16.355 ms 16.392 ms] Rust [14.568 ms 14.589 ms 14.611 ms] ~ 89% time elapsed

interp process: 49 taps 4 factors 2ch C [24.234 ms 24.255 ms 24.276 ms] Rust [23.349 ms 23.460 ms 23.574 ms] ~ 96% time elapsed

After interp process: 49 taps 2 factors 2ch C [15.343 ms 15.368 ms 15.395 ms] Rust [6.1902 ms 6.2016 ms 6.2144 ms] ~ 40% time elapsed

interp process: 49 taps 4 factors 2ch C [24.153 ms 24.172 ms 24.191 ms] Rust [4.9936 ms 4.9994 ms 5.0053 ms] ~ 20% time elapsed

note: I do not know why the 4-factor upscaling in the new code is faster than the 2-factor. It's complete counter-intuitive, but the quickcheck-test against the C-implementation seem to pass (albeit with the degraded precision) so I'm writing it off as some quirk, possibly related to natural memory-alignment for the SIMD-operations

sdroege commented 4 years ago

Thanks! I'll review this in the next days. Switching the coefficients to f32 seems ok, we already store everything else as f32 anyway so the improved precision would only be very transient. I'll check how much effect that has in detail though.

If you have any other questions before I do a proper review, please just ask :)

sdroege commented 4 years ago

Can you add some numbers in the commit message (and PR description) from the benchmarks you did?

rawler commented 4 years ago

Can you add some numbers in the commit message (and PR description) from the benchmarks you did?

Sure. I chose not to do it initially, since I have a little bit of a problem with benchmarks on my laptop (throttling causes some uncertainty). But here's a best effort attached.

sdroege commented 4 years ago

No problem, I can run it here later and update it with my numbers if they're more stable :)

sdroege commented 3 years ago

Thanks for the update. I'll go through the code another time later as it changed quite a bit since last time

rawler commented 3 years ago

There. I think I've resolved all open questions, and from my side I'm ready to merge?

A showed this PR to a colleague who came up with yet another great optimization to this code, but I'd say that can be a PR on it's own, this PR is already doing more than I want, really :)

sdroege commented 3 years ago

Apart from the tests failing, yes :) And can you make the FUP commit message a bit more descriptive?

rawler commented 3 years ago

Sorry, "FUP" is my own cue that I plan to rebase with a "fixup" ;) But I left it as it's separate commit with a proper message instead.

rawler commented 3 years ago

I'm all happy now. Please feel free to merge unless you find something else that should be improved :)

sdroege commented 3 years ago

Thanks!

sdroege / ebur128

interp: Specialized, optimized implementations for common interpolations #34