very high cpu usage - Githubissues

falkTX commented 2 years ago

opening a ticket to generate a discussion around this. currently the plugin is quite heavy, ideas for optimizing its cpu usage would be quite welcome.

we could try a few compiler optimization flags and see what works best. also reducing the gui-oriented calls on the dsp side, as mentioned in other tickets.

this can be a blocker for some people doing live-streams, as the capturing + recording takes a significant amount of cpu. if audio processing does too, the system might not be that much responsive when all parts are on.

trummerschlunk commented 2 years ago

yes, the lighter on cpu, the better. this is quite out of my expertise, so your ideas are very welcome.

magnetophon commented 2 years ago

we could try a few compiler optimization flags and see what works best.

Faust comes with a script that automates that.

falkTX commented 2 years ago

Put in place some benchmarks, here come results. Will post the data points for -scal (which is often considered default/normal optimization) and whichever ends up being best from all the faust provided options. All of these have the default build flags from DPF (-O3 -fast-math -mtune=generic etc) with LTO enabled. Tests were run on a mac-mini M1.

Test 0: default flags, nothing extra added (cold)

-scal : 9.63834 MBytes/sec (DSP CPU % : 7.08506 at 44100 Hz), DSP struct memory size in bytes : 53743544 Best value is : 10.1962 MBytes/sec (DSP CPU % : 6.66458 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Test 0: default flags, nothing extra added (warm)

-scal : 9.63295 MBytes/sec (DSP CPU % : 7.10624 at 44100 Hz), DSP struct memory size in bytes : 53743544 Best value is : 10.1784 MBytes/sec (DSP CPU % : 6.69842 at 44100 Hz) with -vec -lv 0 -vs 8

Test 1: using -Ofast

-scal : 9.63711 MBytes/sec (DSP CPU % : 7.10111 at 44100 Hz), DSP struct memory size in bytes : 53743544 Best value is : 10.1819 MBytes/sec (DSP CPU % : 6.70025 at 44100 Hz) with -vec -lv 0 -vs 8

Test 2: using -fprefetch-loop-arrays

-scal : 9.64519 MBytes/sec (DSP CPU % : 7.0923 at 44100 Hz), DSP struct memory size in bytes : 53743544 Best value is : 10.1949 MBytes/sec (DSP CPU % : 6.68685 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Test 3: using -fsingle-precision-constant

-scal : 9.64692 MBytes/sec (DSP CPU % : 7.08903 at 44100 Hz), DSP struct memory size in bytes : 53743544 Best value is : 10.1922 MBytes/sec (DSP CPU % : 6.6952 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Test 4: using -ftree-vectorize

-scal : 9.63839 MBytes/sec (DSP CPU % : 7.0859 at 44100 Hz), DSP struct memory size in bytes : 53743544 Best value is : 10.1845 MBytes/sec (DSP CPU % : 6.70669 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Test 5: using -funroll-loops

-scal : 9.64829 MBytes/sec (DSP CPU % : 7.08453 at 44100 Hz), DSP struct memory size in bytes : 53743544 Best value is : 10.1973 MBytes/sec (DSP CPU % : 6.67201 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Test 6: using -fprefetch-loop-arrays -funroll-loops -funsafe-loop-optimizations combo

-scal : 9.64283 MBytes/sec (DSP CPU % : 7.09134 at 44100 Hz), DSP struct memory size in bytes : 53743544 Best value is : 10.1936 MBytes/sec (DSP CPU % : 6.67486 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Final test: enabling ALL the flags, that is, -Ofast -fomit-frame-pointer -fprefetch-loop-arrays -fsingle-precision-constant -ftree-vectorize -funroll-loops -funsafe-loop-optimizations

-scal : 9.6323 MBytes/sec (DSP CPU % : 7.07991 at 44100 Hz), DSP struct memory size in bytes : 53743544 Best value is : 10.1931 MBytes/sec (DSP CPU % : 6.78578 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Hopefully now we can see some patterns.

falkTX commented 2 years ago

Best seems to usually be -vec -fun -lv 0 -vs 8 for faust options.

Sadly some of these tests were invalid. clang does not support -fprefetch-loop-arrays or -fsingle-precision-constant, so I will have to run these tests with a different compiler or system.

falkTX commented 2 years ago

Doing same tests now on a x64 cpu, "Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz" reported by /proc/cpuinfo Running each 5 times, to get an average

because this laptop takes a seriously long time to run these, I will do 1 post per type, so I dont accidentally lose precious data

falkTX commented 2 years ago

none/default

-scal : 4.2968 MBytes/sec (DSP CPU % : 16.3113 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.29377 MBytes/sec (DSP CPU % : 16.4532 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.27541 MBytes/sec (DSP CPU % : 16.3647 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.24756 MBytes/sec (DSP CPU % : 16.3387 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.27696 MBytes/sec (DSP CPU % : 16.3153 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.34468 MBytes/sec (DSP CPU % : 13.2309 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.34045 MBytes/sec (DSP CPU % : 12.9906 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.33833 MBytes/sec (DSP CPU % : 13.0843 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.33762 MBytes/sec (DSP CPU % : 13.1468 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.35286 MBytes/sec (DSP CPU % : 12.9168 at 44100 Hz) with -vec -lv 0 -g -vs 8

falkTX commented 2 years ago

using -Ofast

-scal : 4.27438 MBytes/sec (DSP CPU % : 16.8819 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.28457 MBytes/sec (DSP CPU % : 16.3201 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.29002 MBytes/sec (DSP CPU % : 16.3388 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.30806 MBytes/sec (DSP CPU % : 16.159 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.30075 MBytes/sec (DSP CPU % : 16.1666 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.31578 MBytes/sec (DSP CPU % : 13.2509 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.35012 MBytes/sec (DSP CPU % : 13.0298 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.32876 MBytes/sec (DSP CPU % : 13.9076 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.35508 MBytes/sec (DSP CPU % : 12.9712 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.34027 MBytes/sec (DSP CPU % : 13.09 at 44100 Hz) with -vec -lv 0 -g -vs 8

So using -Ofast vs -O3 (default) doesn't really lead to gain here, all is within margin of error.

falkTX commented 2 years ago

using -fsingle-precision-constant (this one actually reliable, as gcc has that option while clang does not)

-scal : 4.2803 MBytes/sec (DSP CPU % : 16.9046 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.28916 MBytes/sec (DSP CPU % : 16.1392 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.26237 MBytes/sec (DSP CPU % : 16.5876 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.26309 MBytes/sec (DSP CPU % : 16.3797 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.27941 MBytes/sec (DSP CPU % : 16.3108 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.33902 MBytes/sec (DSP CPU % : 13.5098 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.33681 MBytes/sec (DSP CPU % : 13 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.31593 MBytes/sec (DSP CPU % : 13.019 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.32571 MBytes/sec (DSP CPU % : 13.0862 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.33375 MBytes/sec (DSP CPU % : 13.1037 at 44100 Hz) with -vec -lv 1 -vs 8

not much difference here. likely faust is declaring the float vs double variables properly and thus this specific optimization is not needed.

falkTX commented 2 years ago

To make sure I am not testing things that have no benefit, I jumped to run the last test with all the flags. same deal as before. these are the results:

-scal : 4.34399 MBytes/sec (DSP CPU % : 15.9678 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.34678 MBytes/sec (DSP CPU % : 15.9518 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.37782 MBytes/sec (DSP CPU % : 15.9254 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.36811 MBytes/sec (DSP CPU % : 15.9093 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.36748 MBytes/sec (DSP CPU % : 16.1839 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.48767 MBytes/sec (DSP CPU % : 12.6423 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.49236 MBytes/sec (DSP CPU % : 12.8233 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.48057 MBytes/sec (DSP CPU % : 12.6647 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.44528 MBytes/sec (DSP CPU % : 13.0698 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.4862 MBytes/sec (DSP CPU % : 12.7427 at 44100 Hz) with -vec -lv 1 -vs 8

this shows small, but definitive improvements on average. so some of the flags are not just placebo but are doing something. we just need to find which ones now

falkTX commented 2 years ago

Using prefetch:

-scal : 4.30744 MBytes/sec (DSP CPU % : 16.3677 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.36694 MBytes/sec (DSP CPU % : 15.9433 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.35577 MBytes/sec (DSP CPU % : 16.2389 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.36447 MBytes/sec (DSP CPU % : 18.2705 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.34609 MBytes/sec (DSP CPU % : 16.1036 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.43295 MBytes/sec (DSP CPU % : 13.4085 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.45406 MBytes/sec (DSP CPU % : 12.7562 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.45127 MBytes/sec (DSP CPU % : 17.2963 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.44205 MBytes/sec (DSP CPU % : 12.7868 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.45081 MBytes/sec (DSP CPU % : 12.6932 at 44100 Hz) with -vec -lv 0 -g -vs 8

Using tree-vectorize:

-scal : 4.26637 MBytes/sec (DSP CPU % : 16.5673 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.28826 MBytes/sec (DSP CPU % : 16.2644 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.3046 MBytes/sec (DSP CPU % : 16.1537 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.3165 MBytes/sec (DSP CPU % : 16.1184 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.28343 MBytes/sec (DSP CPU % : 16.6418 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.35331 MBytes/sec (DSP CPU % : 12.9901 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.35234 MBytes/sec (DSP CPU % : 14.1635 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.32448 MBytes/sec (DSP CPU % : 13.0306 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.32559 MBytes/sec (DSP CPU % : 13.0401 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.34056 MBytes/sec (DSP CPU % : 13.2805 at 44100 Hz) with -vec -lv 0 -g -vs 8

sletz commented 2 years ago

we could try a few compiler optimization flags and see what works best.

Faust comes with a script that automates that.

faustbench-llvm explores a bit more: https://github.com/grame-cncm/faust/tree/master-dev/tools/benchmark#faustbench-llvm

falkTX commented 2 years ago

we could try a few compiler optimization flags and see what works best.

Faust comes with a script that automates that.

faustbench-llvm explores a bit more: https://github.com/grame-cncm/faust/tree/master-dev/tools/benchmark#faustbench-llvm

thanks, I saw it but didnt think it was too relevant here. I am not looking just for what faust flags can do, but compiler flags too, some of them unsupported by llvm/clang.

falkTX commented 2 years ago

Using -funroll-loops

-scal : 4.2748 MBytes/sec (DSP CPU % : 16.1979 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.3259 MBytes/sec (DSP CPU % : 16.1251 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.30736 MBytes/sec (DSP CPU % : 16.1814 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.33969 MBytes/sec (DSP CPU % : 16.0493 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.31645 MBytes/sec (DSP CPU % : 16.0979 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.37112 MBytes/sec (DSP CPU % : 12.9637 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.38857 MBytes/sec (DSP CPU % : 13.0337 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.37493 MBytes/sec (DSP CPU % : 12.9628 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.37612 MBytes/sec (DSP CPU % : 12.9511 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.37365 MBytes/sec (DSP CPU % : 12.8953 at 44100 Hz) with -vec -lv 1 -vs 8

this seems to be one of the good flags :)

falkTX commented 2 years ago

Using -fprefetch-loop-arrays -funroll-loops -funsafe-loop-optimizations combo

-scal : 4.36762 MBytes/sec (DSP CPU % : 16.2816 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.3862 MBytes/sec (DSP CPU % : 16.0541 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.38827 MBytes/sec (DSP CPU % : 15.8699 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.38965 MBytes/sec (DSP CPU % : 15.872 at 44100 Hz), DSP struct memory size in bytes : 53743544 -scal : 4.38993 MBytes/sec (DSP CPU % : 15.7911 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.46754 MBytes/sec (DSP CPU % : 12.6809 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.48121 MBytes/sec (DSP CPU % : 12.8436 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.49011 MBytes/sec (DSP CPU % : 12.6356 at 44100 Hz) with -vec -lv 1 -vs 8 Best value is : 5.48365 MBytes/sec (DSP CPU % : 12.7177 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.48365 MBytes/sec (DSP CPU % : 12.7177 at 44100 Hz) with -vec -lv 0 -g -vs 8 Best value is : 5.48838 MBytes/sec (DSP CPU % : 12.6292 at 44100 Hz) with -vec -lv 0 -g -vs 8

Seems a good combo, but I am not too confident about using -funsafe-loop-optimizations.

Still, with this it seems obvious the best choice is between -vec -lv 0 -g -vs 8 and -vec -lv 1 -vs 8. So we can optimize the benchmarks to pick only these 2, and run more specific ones for the compiler flags. The issue running these before was that I had all the faust modes enabled, so it took a while to build and run.

falkTX commented 2 years ago

I tried the fastmath.cpp stuff, and while performance is better with it, sound is also affected. We end up with a sorta high-pass eq filter on at all times, and random noise bursts. Really unusable.

So after a few more tests here, I decided to go with the -vec -lv 1 -vs 8 faust flags. The -exp10 seemed promising on first glance so I tried it too, but it was half-half best result compared to not having it, so seems to not do much or anything at all.

faust crashes when using those flags on the macOS CI machine though :( so I pregenerated the plugin C++ files and added them to the repo. There are no details to debug, but build issue can be seen in https://github.com/trummerschlunk/master_me/runs/8129699558?check_suite_focus=true

sletz commented 2 years ago

No clear at all what happens...

falkTX commented 2 years ago

just a crash/segfault. I cant reproduce it locally, so it is hard to investigate. could be related to limited RAM on the github VMs too, as macOS will kill processes that hog ram/cpu too much.

x42 commented 2 years ago

@sletz is there a way to separate interpolation of coefficients into a dedicated function and only call it every e.g. 48 samples?

e.g. gain si.smoo is perfectly fine to do small jumps at ~ 20 Hz intervals. Likewise many other coefficients that use pow() could only be updated lazily.

Is it possible to separate metering from the main DSP? Input meters can run before doing any processing, and output meters after. This is often done to avoid conditionals in the inner loop and make use of L1/L2 caches.

Then there are lines like

          fZec582[i] = std::pow(10.0f, 0.0250000004f * (0.0f - 0.333333343f * fZec581[i]));
          fZec583[i] = std::sqrt(fZec582[i]);

The sqrt can be avoided by making a single call to pow (10, 1/2 ...). Then you can even use `exp (a log(b)) = pow (b, a); b is constant here and can be evaluated at compile-time.exp()is significantly faster thanpow()`.

sletz commented 2 years ago

Read this for general info on optimisations

x42 commented 2 years ago

Read this for general info on optimisations

Yes, but can control-rate updates be done at a given interval in samples (or time unit) rather than being considered constant during the block?

Is there a way to merge maths operations like sqrt (pow()) into a single call to exp() or is that left to the user?

sletz commented 2 years ago

Read this for general info on optimisations

Yes, but can control-rate updates be done at a given interval in samples (or time unit) rather than being considered constant during the block?

Not easily for now. But we could add an option to separate the control code (done in the compute function before the actual DSP loop) in a separated "control" function, to be called when needed by the architecture file.

Is there a way to merge maths operations like sqrt (pow()) into a single call to exp() or is that left to the user?

The compiler currently implements some simplifications likeexp(log(x)) = x. We could add some more. Can you list which one would be the more useful?

x42 commented 2 years ago

Can you list which one would be the more useful?

sqrt(pow(a, b)) -- move the sqrt into the exponent: pow (a, b * 0.5f)
exp (a log(b)) = pow (b, a) -- if b is constant (e.g 10) , use `constexpr float l10 = log (10); return exp (a l10);`

Oddly enough in the generated C++ code there are many of those constructs. A partial list (from faust -vec -lv 0 -g -vs 8 master_me.dsp using c03ce7d1e6cbff58) while in the .dsp file there are only 2 calls to sqrt for the correlation meter. Where does the 10^x come from?

          fZec275[i] = std::pow(10.0f, 0.00833333377f * fZec254[i]);
          fZec276[i] = std::sqrt(fZec275[i]); 
...
          fZec308[i] = std::pow(10.0f, 0.0250000004f * (0.0f - 0.333333343f * fZec307[i]));
          fZec309[i] = std::sqrt(fZec308[i]);
...
          fZec328[i] = std::pow(10.0f, 0.00833333377f * fZec307[i]); 
          fZec329[i] = std::sqrt(fZec328[i]);   
...
          fZec361[i] = std::pow(10.0f, 0.0250000004f * (0.0f - 0.333333343f * fZec360[i]));
          fZec362[i] = std::sqrt(fZec361[i]);  
...
          fZec381[i] = std::pow(10.0f, 0.00833333377f * fZec360[i]);                                                                                                                        
          fZec382[i] = std::sqrt(fZec381[i]);

Another example is

          fZec711[i] = fSlow99 / fZec709[i];
          fZec712[i] = fSlow99 * (fZec711[i] + 2.0f) / fZec709[i] + 1.0f;

In this case (fSlow99 / fZec709[i]) would only need to be computed once. One division can be saved. There are a many such instances in the compute loop (I stopped counting at 20). Saving 20+ divisions per sample is already a lot.

Then there's code from e.g. "Vectorizable loop 123"

fZec41[i] = fSlow35 * (fSlow35 / fZec39[i] + 1.42857146f) / fZec39[i] + 1.0f;

I doubt that this can be vectorized. two divisions, two sums and a multiplication. What intrinsic function provides for this? This can be expanded to:

 float const tmp = fSlow35 / fZec39[i];
 fZec41[i] = tmp * tmp  + 1.42857146f * tmp  + 1.f;

All in all the generated code is however pretty impressive. Writing a complex project like master_me directly in (hand optimized) C would be a significantly more complex task, and it would certainly not be as easy to have it evolve.

sletz commented 2 years ago

fZec275[i] = std::pow(10.0f, 0.00833333377f * fZec254[i]); fZec276[i] = std::sqrt(fZec275[i]);

Well here fZec275[i] is used several times in the code later on (which is the default behaviour: shared sub-expressions avec computed once in a variable, then reused), so not sure it will help...

fZec711[i] = fSlow99 / fZec709[i];

I don't see the pattern here, where are they saved division ?

"Vectorizable loop 123"

When the compiler write this, it means no recursive dependancy exists in the loop. Then we assume the auto-vectoriser can do something efficient here. Have you checked what the compiler produces here? Is your version better?

x42 commented 2 years ago

Well here fZec275[i] is used several times in the code later on

Indeed. I missed that . It is used to update the SVF coefficients every sample. That's a different story.

I don't see the pattern here, where are they saved division ?

fZec711[i] = fSlow99 / fZec709[i];
fZec712[i] = fSlow99 * (fZec711[i] + 2.0f) / fZec709[i] + 1.0f;

can be written as

float const tmp = fSlow99 / fZec709[i];
fZec711[i] = tmp;
fZec712[i] = tmp * (fZec711[i] + 2.0f) + 1.0f;

"Vectorizable loop 123"

Is your version better?

Apparently so: https://godbolt.org/z/r88dszdzP It needs only 2 registers (not 3), and performs only one division.

On most CPUs divss is 4-5 times slower than mulss (which usually takes only 1 or 2 cycles). Apple's M1 (fdiv) is a notable exception.

sletz commented 2 years ago

float const tmp = fSlow99 / fZec709[i]; fZec711[i] = tmp; fZec712[i] = tmp * (fZec711[i] + 2.0f) + 1.0f;

OK it seems some optimisations are missed here. Possibly in the way we compute the signals "normal form" where + and * operations are supposed to be sorted in an optimal way. @orlarey any idea here?

Is your version better?

With auto-vectorisation we coud expect SIMD operation to be used, so we should probably compare complete loops

x42 commented 2 years ago

I missed one substitution:

float const tmp = fSlow99 / fZec709[i];
fZec711[i] = tmp;
fZec712[i] = tmp * (tmp + 2.0f) + 1.0f;

With auto-vectorisation we coud expect SIMD operation to be used

Perhaps with AVX or FMA, but there's no SSE intrinsic that would work here; but even then multiplication is faster.

x42 commented 2 years ago

FMA Fused Multiply-Add can be used with the refactored code: https://godbolt.org/z/hraMPEY9f but it's still not SIMD.

Edit: gcc can vectorize it https://godbolt.org/z/x8466WPMY

orlarey commented 2 years ago

In principle, one can safely use pow() and let the compiler do the optimization (with -ffast-math) as in these examples: https://godbolt.org/z/sdvxor3r3.

Concerning the division not factorized, we will have to see if we can solve the problem by improving the normal form.

Concerning the expressions in bargraphs, we need to improve the type system to take into account the case of a control rate expression built on top of a sample rate expression.

galileo-pkm commented 1 year ago

Not sure if this is the proper place but just running master_me disabled, and no in/out connected, the DSP load in Carla goes from 3% to around 50%.

x42 commented 1 year ago

Sounds like a denormal issue (https://en.wikipedia.org/wiki/Subnormal_number#Performance_issues)

Could be avoided either via compiler options, or by adding a tiny number to the input (of each stage).

sletz commented 1 year ago

This code can be used: https://github.com/grame-cncm/faust/blob/master-dev/architecture/faust/dsp/dsp.h#L236, so adding the AVOIDDENORMALS macro before the call to compute.

falkTX commented 1 year ago

issue should be pushed to carla. I intentionally set up the audio threads so that denormals are not a thing. if they appear, something is wrong..

galileo-pkm commented 1 year ago

I don't see how that would be related to Carla as master_me is running standalone.

falkTX commented 1 year ago

ah you mentioned carla on your post, so it lead the thought it was running there.

falkTX commented 1 year ago

Denormal issue likely fixed in f9992cd4616479020de28522a1369dee6545f63d, as I just pushed https://github.com/DISTRHO/DPF/commit/48eb45016b67547b02d2ac644cd2a147da7cf7b9 to DPF side.

trummerschlunk / master_me

very high cpu usage #66