Closed rawler closed 3 years ago
Thanks! I'll review this in the next days. Switching the coefficients to f32
seems ok, we already store everything else as f32
anyway so the improved precision would only be very transient. I'll check how much effect that has in detail though.
If you have any other questions before I do a proper review, please just ask :)
Can you add some numbers in the commit message (and PR description) from the benchmarks you did?
Can you add some numbers in the commit message (and PR description) from the benchmarks you did?
Sure. I chose not to do it initially, since I have a little bit of a problem with benchmarks on my laptop (throttling causes some uncertainty). But here's a best effort attached.
No problem, I can run it here later and update it with my numbers if they're more stable :)
Thanks for the update. I'll go through the code another time later as it changed quite a bit since last time
There. I think I've resolved all open questions, and from my side I'm ready to merge?
A showed this PR to a colleague who came up with yet another great optimization to this code, but I'd say that can be a PR on it's own, this PR is already doing more than I want, really :)
Apart from the tests failing, yes :) And can you make the FUP
commit message a bit more descriptive?
Sorry, "FUP" is my own cue that I plan to rebase with a "fixup" ;) But I left it as it's separate commit with a proper message instead.
I'm all happy now. Please feel free to merge unless you find something else that should be improved :)
Thanks!
This is step1 of 2 in optimizing true-peak analysis, optimizing the interpolator itself. The key of the optimization is the realization that for every single new frame (sample N channels) of input, there is 4 (or 2) new outputs generated, and that the Polyphase FIR filter can be seen as a 124 -tap matrix instead of a 49-tap vector (first and last tap is 0 anyways). All the 4 new outputs can then be calculated at once as;
sum(input_frame * coefficient_row for [input_frame; coefficient_row] in input_buffer.zip(coefficient_rows))
. This calculation is easily auto-vectorized by the compiler, and for stereo 4x upsampling, at least 8 multiplications and additions will be done at a time.Open question for the PR; the code uses a filter of f32 coefficients. This causes 0.000002 precision-diff with the 64-bit previous implementation, for a small but noticeable speedup, and much simpler generic code. The PR CAN be rewritten keeping the 64-bit filter and accumulator, but the code will be slightly more messy, and some performance lost. As a comparison, ffmpeg:s ebur128 meter only shows TP in dB with a single decimal.
My setup is not perfect for benchmarking (Laptop with speedstep and throttling, but roughly)
Before interp process: 49 taps 2 factors 2ch C [16.324 ms 16.355 ms 16.392 ms] Rust [14.568 ms 14.589 ms 14.611 ms] ~ 89% time elapsed
interp process: 49 taps 4 factors 2ch C [24.234 ms 24.255 ms 24.276 ms] Rust [23.349 ms 23.460 ms 23.574 ms] ~ 96% time elapsed
After interp process: 49 taps 2 factors 2ch C [15.343 ms 15.368 ms 15.395 ms] Rust [6.1902 ms 6.2016 ms 6.2144 ms] ~ 40% time elapsed
interp process: 49 taps 4 factors 2ch C [24.153 ms 24.172 ms 24.191 ms] Rust [4.9936 ms 4.9994 ms 5.0053 ms] ~ 20% time elapsed
note: I do not know why the 4-factor upscaling in the new code is faster than the 2-factor. It's complete counter-intuitive, but the quickcheck-test against the C-implementation seem to pass (albeit with the degraded precision) so I'm writing it off as some quirk, possibly related to natural memory-alignment for the SIMD-operations