biquad filter appears to be slow

xyzzy42 commented 4 years ago

I ported some code to use kfr::biquad_filter and it ended up slower. To try to find out if it was something else causing this, I wrote a simple test program that benchmarks kfr::biquad_filter vs the simplest plain C implementation I thought of. Benchmarking with both gcc 8.3.1 and gcc 9.2.1 on a Ryzen Zen 2 and Intel Kaby Lake, the KFR code is about 75% to 50% the speed of the plain C.

Test program attached: kfrtest.cpp.txt

dancazarin commented 4 years ago

Could you check performance with latest clang? It's recommended compiler for KFR, all performance measurements published here are made using clang. GCC/MSVC performance is improving but can be worse compared to clang.

Could you also publish your results? What sizes did you check?

slarew commented 4 years ago

I can confirm GCC is showing poor performance for the posted kfrtest.cpp test.

$ g++-8 -Wall -O3 -fomit-frame-pointer -march=native -I kfr/include -std=gnu++17 kfrtest.cpp
$ ./a.out 0 100000
Algorithm 0, 100000 samples, 3201 iterations
Total time 1212.814 ms
$ ./a.out 1 100000
Algorithm 1, 100000 samples, 3201 iterations
Total time 1493.064 ms
$ clang++-8 -Wall -O3 -fomit-frame-pointer -march=native -I kfr/include -std=gnu++17 kfrtest.cpp
$ ./a.out 0 100000
Algorithm 0, 100000 samples, 3201 iterations
Total time 1157.509 ms
$ ./a.out 1 100000
Algorithm 1, 100000 samples, 3201 iterations
Total time 906.989 ms

$ g++-8 --version
g++ (Ubuntu 8.3.0-6ubuntu1~18.04.1) 8.3.0

$ clang++-8 --version
clang version 8.0.1- (branches/release_80)
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

$ lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              12
On-line CPU(s) list: 0-11
Thread(s) per core:  2
Core(s) per socket:  6
Socket(s):           1
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               158
Model name:          Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
Stepping:            10
CPU MHz:             800.009
CPU max MHz:         5000.0000
CPU min MHz:         800.0000
BogoMIPS:            7399.70
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            12288K
NUMA node0 CPU(s):   0-11
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d

xyzzy42 commented 4 years ago

I've tried buffer sizes of about 96,000 samples, but once the buffer is over a few thousand samples it makes little difference in run time.

Comparing clang 9.0.0 vs gcc 9.2.1 on an i7-7700K and a Ryzen 3700X. I used -O3, -march=skylake/znver2 and with and without -ffast-math. The architecture compiled for, skylake vs znver2, made no significant difference. Clang was much faster with fast-math while gcc showed only a small effect. With gcc, the plain C code was significant faster than KFR. With clang and fast-math enabled, the C code was still about 15% faster, and the faster overall. With clang and without fast-math, then KFR is faster than C.

                       C  KFR
clang900 skylake    1233  925
clang900 skylake fm  598  690
gcc921   skylake    1238 1532
gcc921   skylake fm 1151 1531

clang900 znver2      880  747
clang900 znver2 fm   695  872
gcc921   znver2     1180 1951
gcc921   znver2 fm  1180 1935

kfrlib / kfr

biquad filter appears to be slow #73