giaf / blasfeo

Basic linear algebra subroutines for embedded optimization
Other
326 stars 89 forks source link

Unpredictable execution time of BLASFEO functions caused by uninitialized padding elements of the matrix #103

Closed mkatliar closed 5 years ago

mkatliar commented 5 years ago

Attached is the source code of a benchmark which measures execution time of the gemm BLAS routine implemented in BLASFEO and in a BLAS library of choice for different matrix sizes.

gemm-benchmark.zip

The benchmark uses the Google Benchmark library.

Running the benchmark with the following command line

./bench --benchmark_repetitions=5

gives the following output:

2019-09-18 13:08:50
Running ./bench
Run on (4 X 3200 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 6144K (x1)
Load Average: 1.86, 1.50, 1.18
--------------------------------------------------------------------------
Benchmark                                Time             CPU   Iterations
--------------------------------------------------------------------------
BM_gemm_blasfeo/2/2/2                  277 ns          277 ns      2521364
BM_gemm_blasfeo/2/2/2                  277 ns          277 ns      2521364
BM_gemm_blasfeo/2/2/2                  277 ns          277 ns      2521364
BM_gemm_blasfeo/2/2/2                  320 ns          320 ns      2521364
BM_gemm_blasfeo/2/2/2                  320 ns          320 ns      2521364
BM_gemm_blasfeo/2/2/2_mean             294 ns          294 ns            5
BM_gemm_blasfeo/2/2/2_median           277 ns          277 ns            5
BM_gemm_blasfeo/2/2/2_stddev          23.5 ns         23.5 ns            5
BM_gemm_blasfeo/3/3/3                 2143 ns         2143 ns       319712
BM_gemm_blasfeo/3/3/3                 2143 ns         2143 ns       319712
BM_gemm_blasfeo/3/3/3                 2142 ns         2142 ns       319712
BM_gemm_blasfeo/3/3/3                 2228 ns         2228 ns       319712
BM_gemm_blasfeo/3/3/3                 2228 ns         2228 ns       319712
BM_gemm_blasfeo/3/3/3_mean            2177 ns         2177 ns            5
BM_gemm_blasfeo/3/3/3_median          2143 ns         2143 ns            5
BM_gemm_blasfeo/3/3/3_stddev          46.6 ns         46.6 ns            5
BM_gemm_blasfeo/5/5/5                11403 ns        11403 ns        61176
BM_gemm_blasfeo/5/5/5                11402 ns        11402 ns        61176
BM_gemm_blasfeo/5/5/5                11402 ns        11402 ns        61176
BM_gemm_blasfeo/5/5/5                 2673 ns         2672 ns        61176
BM_gemm_blasfeo/5/5/5                 2673 ns         2673 ns        61176
BM_gemm_blasfeo/5/5/5_mean            7911 ns         7910 ns            5
BM_gemm_blasfeo/5/5/5_median         11402 ns        11402 ns            5
BM_gemm_blasfeo/5/5/5_stddev          4781 ns         4781 ns            5
BM_gemm_blasfeo/10/10/10             10092 ns        10092 ns        68876
BM_gemm_blasfeo/10/10/10             10093 ns        10093 ns        68876
BM_gemm_blasfeo/10/10/10             10092 ns        10092 ns        68876
BM_gemm_blasfeo/10/10/10              9707 ns         9707 ns        68876
BM_gemm_blasfeo/10/10/10              9707 ns         9706 ns        68876
BM_gemm_blasfeo/10/10/10_mean         9938 ns         9938 ns            5
BM_gemm_blasfeo/10/10/10_median      10092 ns        10092 ns            5
BM_gemm_blasfeo/10/10/10_stddev        211 ns          211 ns            5
BM_gemm_blasfeo/20/20/20              1078 ns         1078 ns       639117
BM_gemm_blasfeo/20/20/20              1078 ns         1078 ns       639117
BM_gemm_blasfeo/20/20/20              1078 ns         1078 ns       639117
BM_gemm_blasfeo/20/20/20              1066 ns         1066 ns       639117
BM_gemm_blasfeo/20/20/20              1067 ns         1067 ns       639117
BM_gemm_blasfeo/20/20/20_mean         1074 ns         1074 ns            5
BM_gemm_blasfeo/20/20/20_median       1078 ns         1078 ns            5
BM_gemm_blasfeo/20/20/20_stddev       6.34 ns         6.34 ns            5
BM_gemm_blasfeo/30/30/30              2594 ns         2594 ns       268109
BM_gemm_blasfeo/30/30/30              2595 ns         2595 ns       268109
BM_gemm_blasfeo/30/30/30              2594 ns         2594 ns       268109
BM_gemm_blasfeo/30/30/30              2595 ns         2595 ns       268109
BM_gemm_blasfeo/30/30/30              2595 ns         2595 ns       268109
BM_gemm_blasfeo/30/30/30_mean         2594 ns         2594 ns            5
BM_gemm_blasfeo/30/30/30_median       2595 ns         2595 ns            5
BM_gemm_blasfeo/30/30/30_stddev      0.340 ns        0.372 ns            5
BM_gemm_cblas/2/2/2                    235 ns          235 ns      2972773
BM_gemm_cblas/2/2/2                    235 ns          235 ns      2972773
BM_gemm_cblas/2/2/2                    235 ns          235 ns      2972773
BM_gemm_cblas/2/2/2                    235 ns          235 ns      2972773
BM_gemm_cblas/2/2/2                    235 ns          235 ns      2972773
BM_gemm_cblas/2/2/2_mean               235 ns          235 ns            5
BM_gemm_cblas/2/2/2_median             235 ns          235 ns            5
BM_gemm_cblas/2/2/2_stddev           0.011 ns        0.011 ns            5
BM_gemm_cblas/3/3/3                    392 ns          392 ns      1786397
BM_gemm_cblas/3/3/3                    392 ns          392 ns      1786397
BM_gemm_cblas/3/3/3                    392 ns          392 ns      1786397
BM_gemm_cblas/3/3/3                    392 ns          392 ns      1786397
BM_gemm_cblas/3/3/3                    392 ns          392 ns      1786397
BM_gemm_cblas/3/3/3_mean               392 ns          392 ns            5
BM_gemm_cblas/3/3/3_median             392 ns          392 ns            5
BM_gemm_cblas/3/3/3_stddev           0.021 ns        0.021 ns            5
BM_gemm_cblas/5/5/5                    472 ns          472 ns      1483886
BM_gemm_cblas/5/5/5                    472 ns          472 ns      1483886
BM_gemm_cblas/5/5/5                    472 ns          472 ns      1483886
BM_gemm_cblas/5/5/5                    472 ns          472 ns      1483886
BM_gemm_cblas/5/5/5                    472 ns          472 ns      1483886
BM_gemm_cblas/5/5/5_mean               472 ns          472 ns            5
BM_gemm_cblas/5/5/5_median             472 ns          472 ns            5
BM_gemm_cblas/5/5/5_stddev           0.380 ns        0.380 ns            5
BM_gemm_cblas/10/10/10                 841 ns          841 ns       817796
BM_gemm_cblas/10/10/10                 841 ns          841 ns       817796
BM_gemm_cblas/10/10/10                 841 ns          841 ns       817796
BM_gemm_cblas/10/10/10                 841 ns          841 ns       817796
BM_gemm_cblas/10/10/10                 841 ns          841 ns       817796
BM_gemm_cblas/10/10/10_mean            841 ns          841 ns            5
BM_gemm_cblas/10/10/10_median          841 ns          841 ns            5
BM_gemm_cblas/10/10/10_stddev        0.249 ns        0.249 ns            5
BM_gemm_cblas/20/20/20                1905 ns         1905 ns       364260
BM_gemm_cblas/20/20/20                1905 ns         1905 ns       364260
BM_gemm_cblas/20/20/20                1904 ns         1904 ns       364260
BM_gemm_cblas/20/20/20                1921 ns         1921 ns       364260
BM_gemm_cblas/20/20/20                1922 ns         1921 ns       364260
BM_gemm_cblas/20/20/20_mean           1911 ns         1911 ns            5
BM_gemm_cblas/20/20/20_median         1905 ns         1905 ns            5
BM_gemm_cblas/20/20/20_stddev         9.13 ns         9.14 ns            5
BM_gemm_cblas/30/30/30                4240 ns         4240 ns       165046
BM_gemm_cblas/30/30/30                4240 ns         4239 ns       165046
BM_gemm_cblas/30/30/30                4239 ns         4239 ns       165046
BM_gemm_cblas/30/30/30                4239 ns         4239 ns       165046
BM_gemm_cblas/30/30/30                4238 ns         4238 ns       165046
BM_gemm_cblas/30/30/30_mean           4239 ns         4239 ns            5
BM_gemm_cblas/30/30/30_median         4239 ns         4239 ns            5
BM_gemm_cblas/30/30/30_stddev        0.559 ns        0.554 ns            5

One can see that, according to the benchmark, the execution time of BLASFEO dgemm() varies a lot, whereas the execution time of the openblas implementation is very stable.

mkatliar commented 5 years ago

I am not sure whether the issue is related or not to the details of how the time is measured in the benchmark. But the difference is surprising, and I have not seen such strange results when benchmarking non-blasfeo code.

mkatliar commented 5 years ago

With --v=2 benchmark option I can get some more info.

Command:

./bench --benchmark_filter=BM_gemm_blasfeo* --benchmark_repetitions=5 --v=2

Output

[...]
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 1
-- LOG(2): Ran in 1.1109e-05/1.1244e-05
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 10
-- LOG(2): Ran in 9.1714e-05/9.1772e-05
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 100
-- LOG(2): Ran in 0.000906435/0.000906326
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 1000
-- LOG(2): Ran in 0.00914121/0.00914061
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 10000
-- LOG(2): Ran in 0.0923856/0.0924196
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687447/0.687481
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687436/0.687472
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687106/0.687139
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.252108/0.25214
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.252166/0.252203
BM_gemm_blasfeo/10/10/10              9073 ns         9073 ns        75769
BM_gemm_blasfeo/10/10/10              9073 ns         9073 ns        75769
BM_gemm_blasfeo/10/10/10              9069 ns         9068 ns        75769
BM_gemm_blasfeo/10/10/10              3328 ns         3327 ns        75769
BM_gemm_blasfeo/10/10/10              3329 ns         3328 ns        75769
BM_gemm_blasfeo/10/10/10_mean         6774 ns         6774 ns            5
BM_gemm_blasfeo/10/10/10_median       9069 ns         9068 ns            5
BM_gemm_blasfeo/10/10/10_stddev       3146 ns         3146 ns            5

You can see that the benchmark was ran with an increasing iteration count to determine the number of required iterations to be 75769. And then it ran 5 times a batch of 75769 calls to gemm, where the first three batches executed in about 0.69s and the last two in 0.25s.

mkatliar commented 5 years ago

The issue is not reproduced if BLASFEO is compiled with LA = REFERENCE instead of LA = HIGH_PERFORMANCE

mkatliar commented 5 years ago

The prefetch instructions are my suspect:

ROUTINE ======================== inner_kernel_dgemm_nn_4x4_lib4
16669307243 16669307243 (flat, cum) 26.66% of Total
         .          .      56e30: cmp    $0x0,%r10d                       ;inner_kernel_dgemm_nn_4x4_lib4 
         .          .      56e34: jle    570a0 <inner_kernel_dgemm_nn_4x4_lib4+0x270>
         .          .      56e3a: vxorpd %ymm4,%ymm4,%ymm4
         .          .      56e3e: vmovapd %ymm4,%ymm5
  14330999   14330999      56e42: vmovapd %ymm4,%ymm6                     ;inner_kernel_dgemm_nn_4x4_lib4 ??
         .          .      56e46: vmovapd %ymm4,%ymm7                     ;inner_kernel_dgemm_nn_4x4_lib4 
         .          .      56e4a: vmovapd (%r11),%ymm13
         .          .      56e4f: cmp    $0x4,%r10d
         .          .      56e53: jle    56f59 <inner_kernel_dgemm_nn_4x4_lib4+0x129>
  14337096   14337096      56e59: nopl   0x0(%rax)                        ;inner_kernel_dgemm_nn_4x4_lib4 ??
  34239958   34239958      56e60: prefetcht0 (%r12,%r13,2)
  86003288   86003288      56e65: prefetcht0 0x40(%r12,%r13,2)
  39820550   39820550      56e6b: vbroadcastsd (%r12),%ymm12
1414449993 1414449993      56e71: vfmadd231pd %ymm13,%ymm12,%ymm0
 273971468  273971468      56e76: vmovapd 0x20(%r11),%ymm14
  78039046   78039046      56e7c: vbroadcastsd 0x20(%r12),%ymm12
mkatliar commented 5 years ago

I commented out all prefetcht0 instructions in BLASFEO code and rebuilt BLASFEO, but the issue is still reproduced.

giaf commented 5 years ago

Is the issue happening for HIGH_PERFORMANCE targets INTEL_SANDY_BRIDGE, INTEL_CORE and AMD_BULLDOZER?

mkatliar commented 5 years ago

The important difference between the benchmark runs is where the memory is allocated. I found out that the alignment of the array data has dramatic effect on the time.

Below are the results from the benchmark run with different allocated memory alignment (last parameter). You see that when the alignment is 0x20 it varies a lot. When it is 0x1000, it does not vary.

--------------------------------------------------------------------------------
Benchmark                                      Time             CPU   Iterations
--------------------------------------------------------------------------------
BM_gemm_blasfeo/2/2/2/32                     123 ns          123 ns     10000000
BM_gemm_blasfeo/2/2/2/32                    33.5 ns         33.5 ns     10000000
BM_gemm_blasfeo/2/2/2/32                    80.4 ns         80.4 ns     10000000
BM_gemm_blasfeo/2/2/2/32                    80.4 ns         80.4 ns     10000000
BM_gemm_blasfeo/2/2/2/32                     210 ns          210 ns     10000000
BM_gemm_blasfeo/2/2/2/32_mean                105 ns          105 ns            5
BM_gemm_blasfeo/2/2/2/32_median             80.4 ns         80.4 ns            5
BM_gemm_blasfeo/2/2/2/32_stddev             66.4 ns         66.4 ns            5
BM_gemm_blasfeo/2/2/2/4096                  33.6 ns         33.6 ns     20821821
BM_gemm_blasfeo/2/2/2/4096                  33.6 ns         33.6 ns     20821821
BM_gemm_blasfeo/2/2/2/4096                  33.6 ns         33.6 ns     20821821
BM_gemm_blasfeo/2/2/2/4096                  33.6 ns         33.6 ns     20821821
BM_gemm_blasfeo/2/2/2/4096                  33.6 ns         33.6 ns     20821821
BM_gemm_blasfeo/2/2/2/4096_mean             33.6 ns         33.6 ns            5
BM_gemm_blasfeo/2/2/2/4096_median           33.6 ns         33.6 ns            5
BM_gemm_blasfeo/2/2/2/4096_stddev          0.015 ns        0.015 ns            5
giaf commented 5 years ago

For the BLASFEO API, 32 bytes is the minimum alignment to avoid segfaults, but 64 bytes is the assumed alignment for performance. But performance deviations should not be that large.

On top of that, what you benchmark is not just the call to the linear algebra routine, but also the memory allocation. And in your script the memory allocation happens differently for the BLASFEO case (alignedAlloc) and for the CBLAS case (std::make_unique). Additionally, depending on when all these small memory chunks are garbage collected, this may pollute quite a lot the memory managed by the allocation system.

In summary, it is not clear to me that the difference in computational time comes from the execution of the BLASFEO routine, or conversely from the memory allocation routine.

mkatliar commented 5 years ago

Here is the updated benchmark, where alignment is added as an extra parameter. gemm-benchmark.zip

The results from the run

./bench --benchmark_filter=BM_gemm_blasfeo* --benchmark_repetitions=5 --benchmark_report_aggregates_only=true

are below:

-------------------------------------------------------------------------------
Benchmark                                     Time             CPU   Iterations
-------------------------------------------------------------------------------
BM_gemm_blasfeo/2/2/2/32_mean              80.2 ns         80.2 ns            5
BM_gemm_blasfeo/2/2/2/32_median            80.8 ns         80.8 ns            5
BM_gemm_blasfeo/2/2/2/32_stddev            31.4 ns         31.4 ns            5
BM_gemm_blasfeo/3/3/3/32_mean               194 ns          194 ns            5
BM_gemm_blasfeo/3/3/3/32_median             214 ns          214 ns            5
BM_gemm_blasfeo/3/3/3/32_stddev             164 ns          164 ns            5
BM_gemm_blasfeo/5/5/5/32_mean               460 ns          460 ns            5
BM_gemm_blasfeo/5/5/5/32_median             376 ns          376 ns            5
BM_gemm_blasfeo/5/5/5/32_stddev             239 ns          239 ns            5
BM_gemm_blasfeo/10/10/10/32_mean           3584 ns         3584 ns            5
BM_gemm_blasfeo/10/10/10/32_median         3775 ns         3775 ns            5
BM_gemm_blasfeo/10/10/10/32_stddev          285 ns          285 ns            5
BM_gemm_blasfeo/20/20/20/32_mean            514 ns          514 ns            5
BM_gemm_blasfeo/20/20/20/32_median          515 ns          515 ns            5
BM_gemm_blasfeo/20/20/20/32_stddev         4.86 ns         4.86 ns            5
BM_gemm_blasfeo/30/30/30/32_mean          10297 ns        10297 ns            5
BM_gemm_blasfeo/30/30/30/32_median         6358 ns         6358 ns            5
BM_gemm_blasfeo/30/30/30/32_stddev         9572 ns         9572 ns            5
BM_gemm_blasfeo/2/2/2/4096_mean             726 ns          726 ns            5
BM_gemm_blasfeo/2/2/2/4096_median           726 ns          726 ns            5
BM_gemm_blasfeo/2/2/2/4096_stddev         0.432 ns        0.431 ns            5
BM_gemm_blasfeo/3/3/3/4096_mean             729 ns          729 ns            5
BM_gemm_blasfeo/3/3/3/4096_median           729 ns          729 ns            5
BM_gemm_blasfeo/3/3/3/4096_stddev         0.096 ns        0.093 ns            5
BM_gemm_blasfeo/5/5/5/4096_mean            2475 ns         2475 ns            5
BM_gemm_blasfeo/5/5/5/4096_median          2475 ns         2475 ns            5
BM_gemm_blasfeo/5/5/5/4096_stddev         0.321 ns        0.321 ns            5
BM_gemm_blasfeo/10/10/10/4096_mean         3774 ns         3774 ns            5
BM_gemm_blasfeo/10/10/10/4096_median       3774 ns         3774 ns            5
BM_gemm_blasfeo/10/10/10/4096_stddev       2.18 ns         2.18 ns            5
BM_gemm_blasfeo/20/20/20/4096_mean          507 ns          507 ns            5
BM_gemm_blasfeo/20/20/20/4096_median        507 ns          507 ns            5
BM_gemm_blasfeo/20/20/20/4096_stddev      0.024 ns        0.025 ns            5
BM_gemm_blasfeo/30/30/30/4096_mean         4270 ns         4270 ns            5
BM_gemm_blasfeo/30/30/30/4096_median       4270 ns         4270 ns            5
BM_gemm_blasfeo/30/30/30/4096_stddev      0.447 ns        0.467 ns            5

You can see that when the alignment is 32 the standard deviation of the time is huge, and with the alignment 4096 it is very small.

However, surprisingly, with large alignment the performace is often many times worse than with the small, compare:

BM_gemm_blasfeo/2/2/2/32_mean              80.2 ns         80.2 ns            5
BM_gemm_blasfeo/2/2/2/4096_mean             726 ns          726 ns            5
...
BM_gemm_blasfeo/5/5/5/32_mean               460 ns          460 ns            5
BM_gemm_blasfeo/5/5/5/4096_mean            2475 ns         2475 ns            5

This is very puzzling.

mkatliar commented 5 years ago

On top of that, what you benchmark is not just the call to the linear algebra routine, but also the memory allocation.

No, it is not. It benchmarks only the call to the routine. If you look in the source, only what is inside the for (auto _ : state) loop is benchmarked. The rest is just preparation of the benchmark (which happens once per batch, but is not timed).

giaf commented 5 years ago

OK sorry I didn't get how that smart auto benchmarking loop works. Alright then, memory allocation is not benchmarked, but only the linear algebra routine call.

mkatliar commented 5 years ago

It seems that not the alignment itself is important, but really the address of the data arrays. This is confirmed by the following test, where the alignment is kept at 0x20 bytes, but the arrays are reused between different runs of the benchmark with the same sizes:

-----------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
BM_gemm_blasfeo_reuse_memory/2/2/2/32_mean             124 ns          124 ns            5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_median           124 ns          124 ns            5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_stddev         0.061 ns        0.061 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_mean             557 ns          557 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_median           557 ns          557 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_stddev         0.658 ns        0.661 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_mean            2506 ns         2505 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_median          2506 ns         2506 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_stddev         0.373 ns        0.379 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_mean         4280 ns         4280 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_median       4280 ns         4280 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_stddev       1.19 ns         1.19 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_mean          511 ns          511 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_median        511 ns          511 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_stddev      0.149 ns        0.147 ns            5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_mean         1641 ns         1641 ns            5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_median       1641 ns         1641 ns            5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_stddev      0.540 ns        0.538 ns            

You see that keeping location of the data removes variability of the timings.

However, one would expect the same performance regardless of the location of the data, which is not the case.

Interestingly, you can see from the result that the time does not necessarily increase with the size of the problem. This is probably caused by the large variation of the time due to different data location.

mkatliar commented 5 years ago

Although re-running the benchmarks gives very similar results, running them in reverse order gives very different results, probably due to changed placement of memory chunks:

------------------------------------------------------------------------------------------
Benchmark                                                Time             CPU   Iterations
------------------------------------------------------------------------------------------
BM_gemm_blasfeo_reuse_memory/30/30/30/32_mean         9420 ns         9420 ns            5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_median       9427 ns         9427 ns            5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_stddev       36.4 ns         36.5 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_mean          508 ns          508 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_median        507 ns          507 ns            5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_stddev      0.355 ns        0.354 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_mean         4781 ns         4781 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_median       4781 ns         4780 ns            5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_stddev       2.00 ns         1.97 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_mean            2297 ns         2297 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_median          2297 ns         2297 ns            5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_stddev         0.048 ns        0.043 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_mean             341 ns          341 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_median           341 ns          341 ns            5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_stddev         0.026 ns        0.025 ns            5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_mean             423 ns          423 ns            5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_median           423 ns          423 ns            5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_stddev         0.037 ns        0.037 ns            5
mkatliar commented 5 years ago

The current version of the benchmark sources gemm-benchmark.zip

mkatliar commented 5 years ago

The same behavior is observed on a machine with AMD Bulldozer and BLASFEO built with TARGET=X64_AMD_BULLDOZER

mkatliar commented 5 years ago

For syscop members, the benchmark code is accessible in the tmpc repo:

git clone git@gitlab.syscop.de:mikhail.katliar/tmpc.git -b 44-mpipm
cd tmpc/bench/blasfeo
make
./bench

You need to install Google Benchmark first, see https://github.com/google/benchmark#installation

mkatliar commented 5 years ago

Explanation of what is the difference between BM_gemm_blasfeo and BM_gemm_blasfeo_reuse_memory.

The benchmark is run as follows. For every combination of parameters (m, n, k, alignment) a loop where the function is called (let's call it a batch) is run several times. Then the mean, median, and standard deviation across batches is computed.

giaf commented 5 years ago

This makes clear that the performance penalty comes from the choice of the specific memory address.

I would like to clarify that you chosen alignment, 32-bytes, is the minimum to have correct behavior (i.e. avoid segmentation faults), but it is not the recommended one in the BLASFEO API. As you can see e.g. here https://github.com/giaf/blasfeo/blob/master/auxiliary/blasfeo_stdlib.c#L36 if you use the BLASFEO functionality to allocate aligned memory, the address is aligned to typical cache size boundaries, i.e. 64 bytes.

Having a memory address aligned to 32 bytes but not to 64 bytes is expected to slightly increase the computational cost, and the matrix actually occupies more cache lines in memory, and therefore more memory has to be moved through the cache hierarchy.

But the effect seen in some of your tests is much larger than the expected one. A possible explanation is that sometimes you also hit memory page boundaries, or something like that.

mkatliar commented 5 years ago

The same behavior is observed with the alignment of 0x40 bytes.

mkatliar commented 5 years ago

"Sometimes you also hit memory page boundaries" would also occur in the BM_gemm_cblas, but we don't see such effect.

giaf commented 5 years ago

would also occur in the BM_gemm_cblas, but we don't see such effect.

That depends on the implementation. E.g OpenBLAS typically packs all data in internally allocated buffers (likely properly aligned), so unaligned data is accessed only once.

Conversely, BLASFEO (and especially the BLASFEO API) operates natively on some or all of the matrix arguments. So unaligned data is accessed more times.

mkatliar commented 5 years ago

But do you agree that the described behavior of BLASFEO should be fixed?

giaf commented 5 years ago

You can try out MKL instead of OpenBLAS, for the variants 'NN' and 'NT' in the fortran BLAS API (so check out which the corresponding variants are through the CBLAS API).

Depending on the outcome, we can speak about "fixes".

mkatliar commented 5 years ago

Well, once we know about this issue, one cannot anymore say anything meaningful about BLASFEO performance. Instead of saying "this function executes in 30ns" one should say "this function executes in between 30 and 300 or maybe 500ns, depending on where you allocate memory, and we don't know how it depends".

giaf commented 5 years ago

Before undergoing the complete reimplementation of BLASFEO (since choices about internal memory formats are at the very foundations), I will investigate personally and with independent benchmarks the issue.

If there is some "bug" to be fixes (i.e. something specific and non-fundamental which triggers the behavior) it will be fixed, but if this only comes form the implementation choices, I guess a note about providing a minimum alignment for maximum performance will make it.

In some benchmarks I did in the past, I encountered some performance degradation in MKL if memory is not aligned, so if MKL is good with that, I'm good too ;)

Thanks for the deep investigation of the issue and I'll keep you updates on the next steps!

mkatliar commented 5 years ago

Results with MKL BLAS (Fortran API, although I don't see why Fortran API vs CBlas API would make difference):

-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
BM_gemm_blas/2/2/2_mean             155 ns          155 ns            5
BM_gemm_blas/2/2/2_median           155 ns          155 ns            5
BM_gemm_blas/2/2/2_stddev         0.037 ns        0.036 ns            5
BM_gemm_blas/3/3/3_mean             162 ns          162 ns            5
BM_gemm_blas/3/3/3_median           162 ns          162 ns            5
BM_gemm_blas/3/3/3_stddev         0.023 ns        0.020 ns            5
BM_gemm_blas/5/5/5_mean             186 ns          186 ns            5
BM_gemm_blas/5/5/5_median           186 ns          186 ns            5
BM_gemm_blas/5/5/5_stddev         0.018 ns        0.019 ns            5
BM_gemm_blas/10/10/10_mean          265 ns          265 ns            5
BM_gemm_blas/10/10/10_median        265 ns          265 ns            5
BM_gemm_blas/10/10/10_stddev      0.018 ns        0.018 ns            5
BM_gemm_blas/20/20/20_mean          727 ns          727 ns            5
BM_gemm_blas/20/20/20_median        729 ns          729 ns            5
BM_gemm_blas/20/20/20_stddev       2.33 ns         2.33 ns            5
BM_gemm_blas/30/30/30_mean         1738 ns         1738 ns            5
BM_gemm_blas/30/30/30_median       1738 ns         1738 ns            5
BM_gemm_blas/30/30/30_stddev       1.58 ns         1.59 ns            5
mkatliar commented 5 years ago

@giaf I appreciate if you do independent benchmarking. I have created a separate repo with my benchmark code here: https://github.com/mkatliar/blasfeo_benchmark. Maybe it is a good place to start with and add your independent benchmarks.

giaf commented 5 years ago

As investigated here https://github.com/mkatliar/blasfeo_benchmark/issues/4#issuecomment-534905197, the issue is due to the presence of denormals in the memory passed to BLASFEO in order to create the matrices. In case of matrices of size not multiple of the minimum kernel size (usually 4x4), the matrix multiple of 4x4 is actually used for computation, and the extra computed elements discarded when writing the result form registers into memory. Therefore, if there are denormals on the "dark" part of the memory of the matrix multiple of 4x4, they do not affect the correctness of the result, but they may severely slow down computations like in this case.

To answer to why there are denormals at all in the memory (and consistently across different test machines and OSs), this may be due to what the google benchmark itself does.

mkatliar commented 5 years ago

It is BLASFEO's responsibility that the matrix structure after blasfeo_create_dmat() is properly functional. The client's responsibility is to allocate the memory and properly set elements of the matrices.

I don't consider this issue as "Closed". To fix the issue, blasfeo_create_dmat() needs to be changed such that the padding elements are initialized to 0.

giaf commented 5 years ago

We will write a "best practice" in the BLASFEO repo, but what you propose is not the solution.

If you know what you do, you can put already the data in the memory and the pass it to the blasfeo_create_dmat, which simply puts a BLASFEO struct around it to be able to call BLASFEO API routines. We use this trick in some of our projects (as I know what to do, since I wrote it). Setting to zero the "padding" memory would break this.

giaf commented 5 years ago

On top of that, blasfeo_create_dmat is very light weight, and does not touch the data at all also for performance reasons. E.g. in acados, the workspace is allocated once, but BLASFEO matrices are repeatedly created on that memory. Memory allocations happens "off-line" and therefore it is cheap to zero it out there, while the BLASFEO structs creation also happens "on-line" and therefore it should be cheap by design.

mkatliar commented 5 years ago

If you know what you do, you can put already the data in the memory and the pass it to the blasfeo_create_dmat, which simply puts a BLASFEO struct around it to be able to call BLASFEO API routines.

"What to do" depends on the internal storage format of the matrices and the panel size. This can change depending on the BLASFEO target architecture and the internal implementation of BLASFEO, which as we know can undergo changes. The layout of the allocated memory is not documented and is unknown to the library user. The user should be abstracted from this knowledge, and this is the purpose of the blasfeo_dmat struct and the related functions.

The way you do it creates potential problems for potentially many users in many places. Instead of solving the problem in one place and making the user safe (and saving his of her lifetime), you create these potential problems, for the sake of vague performance considerations.

giaf commented 5 years ago

"What to do" depends on the internal storage format of the matrices and the panel size. This can change depending on the BLASFEO target architecture and the internal implementation of BLASFEO, which as we know can undergo changes. The layout of the allocated memory is not documented and is unknown to the library user. The user should be abstracted from this knowledge, and this is the purpose of the blasfeo_dmat struct and the related functions.

Sure it is undocumented exactly because it is unsafe. But nonetheless since we are the developers of BLASFEO, we can use knowledge about its internal in our own software.

giaf commented 5 years ago

The way you do it creates potential problems for potentially many users in many places. Instead of solving the problem in one place and making the user safe (and saving his of her lifetime), you create these potential problems, for the sake of vague performance considerations.

Just to be clear, the results computed by the routine are CORRECT. This does not create potential problems for potentially many users in many places.

Computational speed can be lower if some good practice rules are not followed. We were aware about slight performance drops in case of not aligned memory, but they are in the order of some percent, and therefore not of interest for the normal user.

We were not aware of the large performance penalty (in the order of 10x or more) that can arise from having denormals in the padding memory. Now that we are aware of this, we will add a note to make sure that other users can avoid this issue.

We thank you for your performance issue report and for the effort and help in finding the reason for that. The lifetime you invested there will not be wasted.