tim-janik / beast

Beast - Music Synthesizer and Composer
GNU Lesser General Public License v2.1
84 stars 12 forks source link

TESTS: testresampler: fix resampler tests for clang9 #140

Closed swesterfeld closed 4 years ago

swesterfeld commented 4 years ago

Relax expected accuracy for 24-bit subsampling from 126 to 124.5 dB, to fix testresampler for clang9.

This should fix #139.

tim-janik commented 4 years ago

Relax expected accuracy for 24-bit subsampling from 126 to 124.5 dB, to fix testresampler for clang9.

This should fix #139.

Did you find out why this is needed? I.e. peeked at the generated assembly to figure if -ffast-math related options possibly allow transformations that could become problematic for us in the long term?

swesterfeld commented 4 years ago

Did you find out why this is needed? I.e. peeked at the generated assembly to figure if -ffast-math related options possibly allow transformations that could become problematic for us in the long term?

Short answer: I spent some time debugging it, -mfma is causing this - removing the flag gives us the old behaviour, clang does use fused multiply add instructions in the inner loop of the resampler. I don't believe that this optimization has negative impact for us, so adjusting the threshold is the sane thing to do here.

Long answer:

Lets first look at what exactly is failing here. We have (in explicit form):

$ out/tests/suite1 --resampler accuracy --fpu --precision=24 --subsample --freq-scan=90,9000,983 --freq-scan-verbose --verbose

############## clang9 without fma

# accuracy test for factor 2 subsampling using FPU instructions
#   input frequency range used [ 90.00 Hz, 9000.00 Hz ] (SR = 44100.0 Hz, freq increment = 983.00)
90.00000000000000000 -129.45954766716391759
1073.00000000000000000 -130.16677118621706200
2056.00000000000000000 -129.70751203525446726
3039.00000000000000000 -132.36528831269109219
4022.00000000000000000 -128.39849621726884266
5005.00000000000000000 -128.95512230052295877
5988.00000000000000000 -131.49506213647262598
6971.00000000000000000 -131.66641240173382243
7954.00000000000000000 -131.26927901299603718
8937.00000000000000000 -134.06944494072371299
#   max difference between correct and computed output: 0.000000 = -128.398496 dB

############## clang9 with fma

# accuracy test for factor 2 subsampling using FPU instructions
#   input frequency range used [ 90.00 Hz, 9000.00 Hz ] (SR = 44100.0 Hz, freq increment = 983.00)
90.00000000000000000 -129.87588683579858184
1073.00000000000000000 -126.11881934685534645
2056.00000000000000000 -128.02398873831205606
3039.00000000000000000 -128.56669380739046460
4022.00000000000000000 -124.92935085933235939
5005.00000000000000000 -128.10076816664792432
5988.00000000000000000 -128.49709000891502342
6971.00000000000000000 -127.39047422816582866
7954.00000000000000000 -127.65962334132801459
8937.00000000000000000 -130.40537251311835121
#   max difference between correct and computed output: 0.000001 = -124.929351 dB

So we're testing 24 bit downsampling followed by upsampling with FPU instructions here. Using a 4022 Hz sine wave performs worse with fma. Note that 24 bit is the most problematic test we have, since the accuracy of the floating point computations is not really good enough to reliably evaluate the convolution of the large FIR filter. Anyway if we look at the source and assembly of the FPU code, we'll see the difference:

Source Code:

template<class Accumulator> static inline Accumulator
fir_process_one_sample (const float *input,
                        const float *taps, /* [0..order-1] */
                        const uint   order)
{
  Accumulator out = 0;
  for (uint i = 0; i < order; i++)
    out += input[i] * taps[i];
  return out;
}

Both assembly dumps use loop unrolling, so I'm truncating the assembly code. Also I only show the upsampling step here, but downsampling looks the same.

Code generated with clang9 & -mfma

0000000000000000 <Bse::Resampler2::Upsampler2<52u, true>::process_sample_unaligned(float const*, float*)>:
   0:   48 8b 47 08             mov    0x8(%rdi),%rax
   4:   c5 fa 10 06             vmovss (%rsi),%xmm0
   8:   c5 fa 10 4e 04          vmovss 0x4(%rsi),%xmm1
   d:   c5 f2 59 48 04          vmulss 0x4(%rax),%xmm1,%xmm1
  12:   c4 e2 79 b9 08          vfmadd231ss (%rax),%xmm0,%xmm1
  17:   c5 fa 10 46 08          vmovss 0x8(%rsi),%xmm0
  1c:   c4 e2 71 99 40 08       vfmadd132ss 0x8(%rax),%xmm1,%xmm0
  22:   c5 fa 10 4e 0c          vmovss 0xc(%rsi),%xmm1
  27:   c4 e2 79 99 48 0c       vfmadd132ss 0xc(%rax),%xmm0,%xmm1
  2d:   c5 fa 10 46 10          vmovss 0x10(%rsi),%xmm0
  32:   c4 e2 71 99 40 10       vfmadd132ss 0x10(%rax),%xmm1,%xmm0
  38:   c5 fa 10 4e 14          vmovss 0x14(%rsi),%xmm1
  3d:   c4 e2 79 99 48 14       vfmadd132ss 0x14(%rax),%xmm0,%xmm1
  43:   c5 fa 10 46 18          vmovss 0x18(%rsi),%xmm0
  48:   c4 e2 71 99 40 18       vfmadd132ss 0x18(%rax),%xmm1,%xmm0
...

Code generated with clang9 without -mfma

 6204 
 6205 0000000000000000 <Bse::Resampler2::Upsampler2<52u, true>::process_sample_unaligned(float const*, float*)>:
 6206    0:   48 8b 47 08             mov    0x8(%rdi),%rax
 6207    4:   c5 fa 10 06             vmovss (%rsi),%xmm0
 6208    8:   c5 fa 10 4e 04          vmovss 0x4(%rsi),%xmm1
 6209    d:   c5 fa 59 00             vmulss (%rax),%xmm0,%xmm0
 6210   11:   c5 f2 59 48 04          vmulss 0x4(%rax),%xmm1,%xmm1
 6211   16:   c5 fa 58 c1             vaddss %xmm1,%xmm0,%xmm0
 6212   1a:   c5 fa 10 4e 08          vmovss 0x8(%rsi),%xmm1
 6213   1f:   c5 f2 59 48 08          vmulss 0x8(%rax),%xmm1,%xmm1
 6214   24:   c5 fa 10 56 0c          vmovss 0xc(%rsi),%xmm2
 6215   29:   c5 ea 59 50 0c          vmulss 0xc(%rax),%xmm2,%xmm2
 6216   2e:   c5 f2 58 ca             vaddss %xmm2,%xmm1,%xmm1
 6217   32:   c5 fa 58 c1             vaddss %xmm1,%xmm0,%xmm0
 6218   36:   c5 fa 10 4e 10          vmovss 0x10(%rsi),%xmm1
 6219   3b:   c5 f2 59 48 10          vmulss 0x10(%rax),%xmm1,%xmm1
 6220   40:   c5 fa 10 56 14          vmovss 0x14(%rsi),%xmm2
 6221   45:   c5 ea 59 50 14          vmulss 0x14(%rax),%xmm2,%xmm2
 6222   4a:   c5 f2 58 ca             vaddss %xmm2,%xmm1,%xmm1
 6223   4e:   c5 fa 10 56 18          vmovss 0x18(%rsi),%xmm2
 6224   53:   c5 ea 59 50 18          vmulss 0x18(%rax),%xmm2,%xmm2
 6225   58:   c5 f2 58 ca             vaddss %xmm2,%xmm1,%xmm1
 6226   5c:   c5 fa 58 c1             vaddss %xmm1,%xmm0,%xmm0
 6227   60:   c5 fa 10 4e 1c          vmovss 0x1c(%rsi),%xmm1
 6228   65:   c5 f2 59 48 1c          vmulss 0x1c(%rax),%xmm1,%xmm1
...

So the difference here is that without -mfma we are multiplying/adding in two steps, each time truncating down to float precision after each step.

With -mfma we are multiplying/adding in one step (with "infinite resolution"), and then truncating down to float precision.

This means that we're getting a different result. That in this particular case the -mfma code performs worse than the version using individual multiply/add instructions is probably because "different result" could mean better or worse in the total effects of somewhat random errors caused by limited precision of floating point math. But both appear to be valid resamplers, and both appear to be permitted translations of C++ to assembly code.

How does it perform?

Finally just for fun, lets benchmark things.

clang9 & -mfma:

$ out/tests/suite1 --resampler perf --fpu --precision=24 --subsample
performance test for factor 2 subsampling using FPU instructions
  total samples processed = 64000000
  processing_time = 1.333207
  samples / second = 48004552.319213
  which means the resampler can process 1088.54 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.091866 % CPU usage

clang9 without -mfma:

$ out/tests/suite1 --resampler perf --fpu --precision=24 --subsample
performance test for factor 2 subsampling using FPU instructions
  total samples processed = 64000000
  processing_time = 0.792904
  samples / second = 80715936.375136
  which means the resampler can process 1830.29 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.054636 % CPU usage

clang9 with -mfma with SSE:

$ out/tests/suite1 --resampler perf --precision=24 --subsample
performance test for factor 2 subsampling using SSE instructions
  total samples processed = 64000000
  processing_time = 0.449175
  samples / second = 142483403.990601
  which means the resampler can process 3230.92 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.030951 % CPU usage

clang9 without -mfma with SSE:

$ out/tests/suite1 --resampler perf --precision=24 --subsample
performance test for factor 2 subsampling using SSE instructions
  total samples processed = 64000000
  processing_time = 0.344894
  samples / second = 185564296.725403
  which means the resampler can process 4207.81 44100 Hz streams simultaneusly
  or one 44100 Hz stream takes 0.023765 % CPU usage

Remarks:

SSE is always faster than FPU implementation. The best throughput is provided by the SSE without -mfma version.

On my Ryzen-7 machine, the "optimizations" that clang9 does with -mfma make the code slower. I'd assume that reducing instruction count should have a positive effect here. However, maybe mulss and addss are faster because they don't require "infinite precision", so they are cheaper to implement on the CPU.

There is an effect of -mfma on the SSE code. A quick investigation with perf showed that this is due to the test code intentionally testing resampling of non-SSE-aligned memory, so that the unaligned parts need to be computed using the FPU (where -mfma has an effect).

swesterfeld commented 4 years ago

As additional information, g++-8 also generates different code if you use -mfma.

$ out/tests/suite1 --resampler accuracy --fpu --precision=24 --subsample --freq-scan=90,9000,983 --freq-scan-verbose --verbose

############## g++ 8.3.0 without fma

# accuracy test for factor 2 subsampling using FPU instructions
#   input frequency range used [ 90.00 Hz, 9000.00 Hz ] (SR = 44100.0 Hz, freq increment = 983.00)
90.00000000000000000 -129.40792803150048940
1073.00000000000000000 -130.14152510098830362
2056.00000000000000000 -131.64395956575623359
3039.00000000000000000 -131.48377752074048885
4022.00000000000000000 -127.52623479658792860
5005.00000000000000000 -131.74602243860934436
5988.00000000000000000 -131.71784952867952256
6971.00000000000000000 -133.09696883539237433
7954.00000000000000000 -131.53986659892970579
8937.00000000000000000 -131.78909542900839824
#   max difference between correct and computed output: 0.000000 = -127.526235 dB

############## g++ 8.3.0 with fma

# accuracy test for factor 2 subsampling using FPU instructions
#   input frequency range used [ 90.00 Hz, 9000.00 Hz ] (SR = 44100.0 Hz, freq increment = 983.00)
90.00000000000000000 -129.82104550124302023
1073.00000000000000000 -131.03707434810391419
2056.00000000000000000 -132.40996614732904391
3039.00000000000000000 -133.12443453736065635
4022.00000000000000000 -129.32954166513573568
5005.00000000000000000 -131.90536134591866357
5988.00000000000000000 -130.93045250711062977
6971.00000000000000000 -132.01135805265508338
7954.00000000000000000 -132.97021680274394839
8937.00000000000000000 -133.11067142912733630
#   max difference between correct and computed output: 0.000000 = -129.329542 dB

So (unlike clang9), the fma version provides slightly more accurate results for our 4022 Hz sine wave.

My mental model is this: we start from the error the filter gives us under perfect conditions (infinite precision coefficients and arithmetic). Then we add some random error (with some distribution) that depends on the type and order of the instructions generated by the compiler, due to finite precision. From everything I saw, both clang9 and g++ 8.3.0 provide valid translations with and without -mfma.

tim-janik commented 4 years ago

Great! Thanks a lot for the detailed analysis.