luigi-rosso / daisy-bitstream-correlation

0 stars 0 forks source link

Performance optimization ideas #1

Open recursinging opened 3 years ago

recursinging commented 3 years ago

Hi,

I got the code compiled and running on my Daisy based Eurorack module, Indeed the Algo needs to many cycles unless the frequency window is un-usably small. I was able to narrow it down to the autocorrelation function/lambda, which makes sense, as this scales with the buffer size.

I think it would be a good start to unroll this part of the algo and see how to speed it up. I'll give it a try when I find a little more time for it.

luigi-rosso commented 3 years ago

Yeah unrolling the lambda seems like a good first test, but it definitely is some form of n*n so I think the window size is going to be the big factor with regards to performance.

Another quick fix to at least test the algorithm better might be to increase the callback buffer's size (seed.SetAudioBlockSize I think). That should give it more samples in the buffer and we still process the correlation at the same window size, so it'll avoid the skipping and let you do better testing of the correlation itself (to determine if it's even worthwhile).

recursinging commented 3 years ago

So I did a little rudimentary profiling, and looked into this a little further. It appears a large number of the cycles are spent in the gcc __builtin_popcount method.

The STM32H7 is ARMv7 which has a VCNT instruction. It appears gcc is not using it for this case, but a lookup table method instead. There is a StackOverflow discussion with a lot of insight regarding this topic.

I might need to learn how to write some assembly!

luigi-rosso commented 3 years ago

I'm surprised that function isn't crazy optimized as is. It does get used a lot since the entire thing works off of a bitstream. I haven't looked at the logic closely enough to determine if that's for memory optimization or some algorithmic technique. If it's for memory optimization, doing away with the bitstream entirely may provide faster results (sacrificing 8x more memory usage)....

EDIT: Just saw the specifics you posted about how that function works, interesting...yeah maybe writing some Daisy optimized assembly will help! I'd probably have gcc spit out a .S file (I'm assuming it can do that like clang) to analyze the assembly it's generating to start with.

recursinging commented 3 years ago

I talked to Joel De Guzman, the guy who wrote this algo. He pointed me to the optimizations described inthis blog post

The implemenation is not trivial, so I went ahead and hacked his Q library implementation into DaisySP.

The result is impressive, both in performance and accuracy. It works well on the daisy with signals with an F0 up to around 1200Hz, at that point the optimization mentioned above becomes irrelevant and needs too many cycles again.

I think the ARM VCNT instruction might eek a little more bandwidth out of it.

BTW, If you are a guitar player, have a look through Joel's blog. He came up with this algo in order to drive his infinite sustain guitar. Not entirely unlike what you're aiming to achieve.

luigi-rosso commented 3 years ago

Oh cool, going to check that out! Yeah 1200hz is fine for me! Really curious to try out the optimizations, the zero-crossing optimization is super smart (I thought I saw some of it in the code I was hacking at the other day too).

Woah that blog is a gold mine! Really amazing stuff.

luigi-rosso commented 3 years ago

Wow that Q library is crazy comprehensive. I just got it compiling on Daisy (only hack I had to do was comment out exceptions, could send him a PR with those ifdef-ed out). Is that what you had to do? I'll try the pitch detector tomorrow!

recursinging commented 3 years ago

I actually copied just the pitch detection and support classes into DaisySP. I had to massage it a bit, but I'd like to eventually get an implementation merged into DaisySP, because Q, while comprehensive ,lacks the microcontroller focus and has a lot of overlap with DaisySP. And they are both MIT licensed

I'm out on vacation with the family at the moment, so I can't share what I've done right now, Wednesday perhaps.

I did take the frequency estimation from this algo and fed it into a SVF with the Q/resonance around 1, so it's on the brink of self oscillation. Then I passed the Peak output of the filter to the outputs. The result was pretty cool for vocals, I didn't try with a guitar. The pitch tracking is really impressive though. It struggled a little with some octaves, but otherwise spot on and super low latency.

There are a lot interesting directions to take this.