Closed mkatliar closed 5 years ago
I am not sure whether the issue is related or not to the details of how the time is measured in the benchmark. But the difference is surprising, and I have not seen such strange results when benchmarking non-blasfeo code.
With --v=2
benchmark option I can get some more info.
Command:
./bench --benchmark_filter=BM_gemm_blasfeo* --benchmark_repetitions=5 --v=2
Output
[...]
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 1
-- LOG(2): Ran in 1.1109e-05/1.1244e-05
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 10
-- LOG(2): Ran in 9.1714e-05/9.1772e-05
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 100
-- LOG(2): Ran in 0.000906435/0.000906326
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 1000
-- LOG(2): Ran in 0.00914121/0.00914061
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 10000
-- LOG(2): Ran in 0.0923856/0.0924196
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687447/0.687481
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687436/0.687472
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.687106/0.687139
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.252108/0.25214
-- LOG(2): Running BM_gemm_blasfeo/10/10/10 for 75769
-- LOG(2): Ran in 0.252166/0.252203
BM_gemm_blasfeo/10/10/10 9073 ns 9073 ns 75769
BM_gemm_blasfeo/10/10/10 9073 ns 9073 ns 75769
BM_gemm_blasfeo/10/10/10 9069 ns 9068 ns 75769
BM_gemm_blasfeo/10/10/10 3328 ns 3327 ns 75769
BM_gemm_blasfeo/10/10/10 3329 ns 3328 ns 75769
BM_gemm_blasfeo/10/10/10_mean 6774 ns 6774 ns 5
BM_gemm_blasfeo/10/10/10_median 9069 ns 9068 ns 5
BM_gemm_blasfeo/10/10/10_stddev 3146 ns 3146 ns 5
You can see that the benchmark was ran with an increasing iteration count to determine the number of required iterations to be 75769. And then it ran 5 times a batch of 75769 calls to gemm
, where the first three batches executed in about 0.69s and the last two in 0.25s.
The issue is not reproduced if BLASFEO is compiled with LA = REFERENCE
instead of LA = HIGH_PERFORMANCE
The prefetch instructions are my suspect:
ROUTINE ======================== inner_kernel_dgemm_nn_4x4_lib4
16669307243 16669307243 (flat, cum) 26.66% of Total
. . 56e30: cmp $0x0,%r10d ;inner_kernel_dgemm_nn_4x4_lib4
. . 56e34: jle 570a0 <inner_kernel_dgemm_nn_4x4_lib4+0x270>
. . 56e3a: vxorpd %ymm4,%ymm4,%ymm4
. . 56e3e: vmovapd %ymm4,%ymm5
14330999 14330999 56e42: vmovapd %ymm4,%ymm6 ;inner_kernel_dgemm_nn_4x4_lib4 ??
. . 56e46: vmovapd %ymm4,%ymm7 ;inner_kernel_dgemm_nn_4x4_lib4
. . 56e4a: vmovapd (%r11),%ymm13
. . 56e4f: cmp $0x4,%r10d
. . 56e53: jle 56f59 <inner_kernel_dgemm_nn_4x4_lib4+0x129>
14337096 14337096 56e59: nopl 0x0(%rax) ;inner_kernel_dgemm_nn_4x4_lib4 ??
34239958 34239958 56e60: prefetcht0 (%r12,%r13,2)
86003288 86003288 56e65: prefetcht0 0x40(%r12,%r13,2)
39820550 39820550 56e6b: vbroadcastsd (%r12),%ymm12
1414449993 1414449993 56e71: vfmadd231pd %ymm13,%ymm12,%ymm0
273971468 273971468 56e76: vmovapd 0x20(%r11),%ymm14
78039046 78039046 56e7c: vbroadcastsd 0x20(%r12),%ymm12
I commented out all prefetcht0
instructions in BLASFEO code and rebuilt BLASFEO, but the issue is still reproduced.
Is the issue happening for HIGH_PERFORMANCE
targets INTEL_SANDY_BRIDGE
, INTEL_CORE
and AMD_BULLDOZER
?
The important difference between the benchmark runs is where the memory is allocated. I found out that the alignment of the array data has dramatic effect on the time.
Below are the results from the benchmark run with different allocated memory alignment (last parameter). You see that when the alignment is 0x20 it varies a lot. When it is 0x1000, it does not vary.
--------------------------------------------------------------------------------
Benchmark Time CPU Iterations
--------------------------------------------------------------------------------
BM_gemm_blasfeo/2/2/2/32 123 ns 123 ns 10000000
BM_gemm_blasfeo/2/2/2/32 33.5 ns 33.5 ns 10000000
BM_gemm_blasfeo/2/2/2/32 80.4 ns 80.4 ns 10000000
BM_gemm_blasfeo/2/2/2/32 80.4 ns 80.4 ns 10000000
BM_gemm_blasfeo/2/2/2/32 210 ns 210 ns 10000000
BM_gemm_blasfeo/2/2/2/32_mean 105 ns 105 ns 5
BM_gemm_blasfeo/2/2/2/32_median 80.4 ns 80.4 ns 5
BM_gemm_blasfeo/2/2/2/32_stddev 66.4 ns 66.4 ns 5
BM_gemm_blasfeo/2/2/2/4096 33.6 ns 33.6 ns 20821821
BM_gemm_blasfeo/2/2/2/4096 33.6 ns 33.6 ns 20821821
BM_gemm_blasfeo/2/2/2/4096 33.6 ns 33.6 ns 20821821
BM_gemm_blasfeo/2/2/2/4096 33.6 ns 33.6 ns 20821821
BM_gemm_blasfeo/2/2/2/4096 33.6 ns 33.6 ns 20821821
BM_gemm_blasfeo/2/2/2/4096_mean 33.6 ns 33.6 ns 5
BM_gemm_blasfeo/2/2/2/4096_median 33.6 ns 33.6 ns 5
BM_gemm_blasfeo/2/2/2/4096_stddev 0.015 ns 0.015 ns 5
For the BLASFEO API, 32 bytes is the minimum alignment to avoid segfaults, but 64 bytes is the assumed alignment for performance. But performance deviations should not be that large.
On top of that, what you benchmark is not just the call to the linear algebra routine, but also the memory allocation. And in your script the memory allocation happens differently for the BLASFEO case (alignedAlloc
) and for the CBLAS case (std::make_unique
). Additionally, depending on when all these small memory chunks are garbage collected, this may pollute quite a lot the memory managed by the allocation system.
In summary, it is not clear to me that the difference in computational time comes from the execution of the BLASFEO routine, or conversely from the memory allocation routine.
Here is the updated benchmark, where alignment is added as an extra parameter. gemm-benchmark.zip
The results from the run
./bench --benchmark_filter=BM_gemm_blasfeo* --benchmark_repetitions=5 --benchmark_report_aggregates_only=true
are below:
-------------------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------------------
BM_gemm_blasfeo/2/2/2/32_mean 80.2 ns 80.2 ns 5
BM_gemm_blasfeo/2/2/2/32_median 80.8 ns 80.8 ns 5
BM_gemm_blasfeo/2/2/2/32_stddev 31.4 ns 31.4 ns 5
BM_gemm_blasfeo/3/3/3/32_mean 194 ns 194 ns 5
BM_gemm_blasfeo/3/3/3/32_median 214 ns 214 ns 5
BM_gemm_blasfeo/3/3/3/32_stddev 164 ns 164 ns 5
BM_gemm_blasfeo/5/5/5/32_mean 460 ns 460 ns 5
BM_gemm_blasfeo/5/5/5/32_median 376 ns 376 ns 5
BM_gemm_blasfeo/5/5/5/32_stddev 239 ns 239 ns 5
BM_gemm_blasfeo/10/10/10/32_mean 3584 ns 3584 ns 5
BM_gemm_blasfeo/10/10/10/32_median 3775 ns 3775 ns 5
BM_gemm_blasfeo/10/10/10/32_stddev 285 ns 285 ns 5
BM_gemm_blasfeo/20/20/20/32_mean 514 ns 514 ns 5
BM_gemm_blasfeo/20/20/20/32_median 515 ns 515 ns 5
BM_gemm_blasfeo/20/20/20/32_stddev 4.86 ns 4.86 ns 5
BM_gemm_blasfeo/30/30/30/32_mean 10297 ns 10297 ns 5
BM_gemm_blasfeo/30/30/30/32_median 6358 ns 6358 ns 5
BM_gemm_blasfeo/30/30/30/32_stddev 9572 ns 9572 ns 5
BM_gemm_blasfeo/2/2/2/4096_mean 726 ns 726 ns 5
BM_gemm_blasfeo/2/2/2/4096_median 726 ns 726 ns 5
BM_gemm_blasfeo/2/2/2/4096_stddev 0.432 ns 0.431 ns 5
BM_gemm_blasfeo/3/3/3/4096_mean 729 ns 729 ns 5
BM_gemm_blasfeo/3/3/3/4096_median 729 ns 729 ns 5
BM_gemm_blasfeo/3/3/3/4096_stddev 0.096 ns 0.093 ns 5
BM_gemm_blasfeo/5/5/5/4096_mean 2475 ns 2475 ns 5
BM_gemm_blasfeo/5/5/5/4096_median 2475 ns 2475 ns 5
BM_gemm_blasfeo/5/5/5/4096_stddev 0.321 ns 0.321 ns 5
BM_gemm_blasfeo/10/10/10/4096_mean 3774 ns 3774 ns 5
BM_gemm_blasfeo/10/10/10/4096_median 3774 ns 3774 ns 5
BM_gemm_blasfeo/10/10/10/4096_stddev 2.18 ns 2.18 ns 5
BM_gemm_blasfeo/20/20/20/4096_mean 507 ns 507 ns 5
BM_gemm_blasfeo/20/20/20/4096_median 507 ns 507 ns 5
BM_gemm_blasfeo/20/20/20/4096_stddev 0.024 ns 0.025 ns 5
BM_gemm_blasfeo/30/30/30/4096_mean 4270 ns 4270 ns 5
BM_gemm_blasfeo/30/30/30/4096_median 4270 ns 4270 ns 5
BM_gemm_blasfeo/30/30/30/4096_stddev 0.447 ns 0.467 ns 5
You can see that when the alignment is 32 the standard deviation of the time is huge, and with the alignment 4096 it is very small.
However, surprisingly, with large alignment the performace is often many times worse than with the small, compare:
BM_gemm_blasfeo/2/2/2/32_mean 80.2 ns 80.2 ns 5
BM_gemm_blasfeo/2/2/2/4096_mean 726 ns 726 ns 5
...
BM_gemm_blasfeo/5/5/5/32_mean 460 ns 460 ns 5
BM_gemm_blasfeo/5/5/5/4096_mean 2475 ns 2475 ns 5
This is very puzzling.
On top of that, what you benchmark is not just the call to the linear algebra routine, but also the memory allocation.
No, it is not. It benchmarks only the call to the routine. If you look in the source, only what is inside the for (auto _ : state)
loop is benchmarked. The rest is just preparation of the benchmark (which happens once per batch, but is not timed).
OK sorry I didn't get how that smart auto benchmarking loop works. Alright then, memory allocation is not benchmarked, but only the linear algebra routine call.
It seems that not the alignment itself is important, but really the address of the data arrays. This is confirmed by the following test, where the alignment is kept at 0x20 bytes, but the arrays are reused between different runs of the benchmark with the same sizes:
-----------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------
BM_gemm_blasfeo_reuse_memory/2/2/2/32_mean 124 ns 124 ns 5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_median 124 ns 124 ns 5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_stddev 0.061 ns 0.061 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_mean 557 ns 557 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_median 557 ns 557 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_stddev 0.658 ns 0.661 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_mean 2506 ns 2505 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_median 2506 ns 2506 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_stddev 0.373 ns 0.379 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_mean 4280 ns 4280 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_median 4280 ns 4280 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_stddev 1.19 ns 1.19 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_mean 511 ns 511 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_median 511 ns 511 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_stddev 0.149 ns 0.147 ns 5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_mean 1641 ns 1641 ns 5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_median 1641 ns 1641 ns 5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_stddev 0.540 ns 0.538 ns
You see that keeping location of the data removes variability of the timings.
However, one would expect the same performance regardless of the location of the data, which is not the case.
Interestingly, you can see from the result that the time does not necessarily increase with the size of the problem. This is probably caused by the large variation of the time due to different data location.
Although re-running the benchmarks gives very similar results, running them in reverse order gives very different results, probably due to changed placement of memory chunks:
------------------------------------------------------------------------------------------
Benchmark Time CPU Iterations
------------------------------------------------------------------------------------------
BM_gemm_blasfeo_reuse_memory/30/30/30/32_mean 9420 ns 9420 ns 5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_median 9427 ns 9427 ns 5
BM_gemm_blasfeo_reuse_memory/30/30/30/32_stddev 36.4 ns 36.5 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_mean 508 ns 508 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_median 507 ns 507 ns 5
BM_gemm_blasfeo_reuse_memory/20/20/20/32_stddev 0.355 ns 0.354 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_mean 4781 ns 4781 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_median 4781 ns 4780 ns 5
BM_gemm_blasfeo_reuse_memory/10/10/10/32_stddev 2.00 ns 1.97 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_mean 2297 ns 2297 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_median 2297 ns 2297 ns 5
BM_gemm_blasfeo_reuse_memory/5/5/5/32_stddev 0.048 ns 0.043 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_mean 341 ns 341 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_median 341 ns 341 ns 5
BM_gemm_blasfeo_reuse_memory/3/3/3/32_stddev 0.026 ns 0.025 ns 5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_mean 423 ns 423 ns 5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_median 423 ns 423 ns 5
BM_gemm_blasfeo_reuse_memory/2/2/2/32_stddev 0.037 ns 0.037 ns 5
The current version of the benchmark sources gemm-benchmark.zip
The same behavior is observed on a machine with AMD Bulldozer and BLASFEO built with TARGET=X64_AMD_BULLDOZER
For syscop members, the benchmark code is accessible in the tmpc repo:
git clone git@gitlab.syscop.de:mikhail.katliar/tmpc.git -b 44-mpipm
cd tmpc/bench/blasfeo
make
./bench
You need to install Google Benchmark first, see https://github.com/google/benchmark#installation
Explanation of what is the difference between BM_gemm_blasfeo
and BM_gemm_blasfeo_reuse_memory
.
The benchmark is run as follows. For every combination of parameters (m, n, k, alignment) a loop where the function is called (let's call it a batch) is run several times. Then the mean, median, and standard deviation across batches is computed.
BM_gemm_blasfeo: Before running every batch, new memory chunks are allocated.
BM_gemm_blasfeo_reuse_memory: All batches corresponding to the same set of parameters (m, n, k, alignment) use the same memory chunks.
This makes clear that the performance penalty comes from the choice of the specific memory address.
I would like to clarify that you chosen alignment, 32-bytes, is the minimum to have correct behavior (i.e. avoid segmentation faults), but it is not the recommended one in the BLASFEO API. As you can see e.g. here https://github.com/giaf/blasfeo/blob/master/auxiliary/blasfeo_stdlib.c#L36 if you use the BLASFEO functionality to allocate aligned memory, the address is aligned to typical cache size boundaries, i.e. 64 bytes.
Having a memory address aligned to 32 bytes but not to 64 bytes is expected to slightly increase the computational cost, and the matrix actually occupies more cache lines in memory, and therefore more memory has to be moved through the cache hierarchy.
But the effect seen in some of your tests is much larger than the expected one. A possible explanation is that sometimes you also hit memory page boundaries, or something like that.
The same behavior is observed with the alignment of 0x40 bytes.
"Sometimes you also hit memory page boundaries" would also occur in the BM_gemm_cblas, but we don't see such effect.
would also occur in the BM_gemm_cblas, but we don't see such effect.
That depends on the implementation. E.g OpenBLAS typically packs all data in internally allocated buffers (likely properly aligned), so unaligned data is accessed only once.
Conversely, BLASFEO (and especially the BLASFEO API) operates natively on some or all of the matrix arguments. So unaligned data is accessed more times.
But do you agree that the described behavior of BLASFEO should be fixed?
You can try out MKL instead of OpenBLAS, for the variants 'NN' and 'NT' in the fortran BLAS API (so check out which the corresponding variants are through the CBLAS API).
Depending on the outcome, we can speak about "fixes".
Well, once we know about this issue, one cannot anymore say anything meaningful about BLASFEO performance. Instead of saying "this function executes in 30ns" one should say "this function executes in between 30 and 300 or maybe 500ns, depending on where you allocate memory, and we don't know how it depends".
Before undergoing the complete reimplementation of BLASFEO (since choices about internal memory formats are at the very foundations), I will investigate personally and with independent benchmarks the issue.
If there is some "bug" to be fixes (i.e. something specific and non-fundamental which triggers the behavior) it will be fixed, but if this only comes form the implementation choices, I guess a note about providing a minimum alignment for maximum performance will make it.
In some benchmarks I did in the past, I encountered some performance degradation in MKL if memory is not aligned, so if MKL is good with that, I'm good too ;)
Thanks for the deep investigation of the issue and I'll keep you updates on the next steps!
Results with MKL BLAS (Fortran API, although I don't see why Fortran API vs CBlas API would make difference):
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
BM_gemm_blas/2/2/2_mean 155 ns 155 ns 5
BM_gemm_blas/2/2/2_median 155 ns 155 ns 5
BM_gemm_blas/2/2/2_stddev 0.037 ns 0.036 ns 5
BM_gemm_blas/3/3/3_mean 162 ns 162 ns 5
BM_gemm_blas/3/3/3_median 162 ns 162 ns 5
BM_gemm_blas/3/3/3_stddev 0.023 ns 0.020 ns 5
BM_gemm_blas/5/5/5_mean 186 ns 186 ns 5
BM_gemm_blas/5/5/5_median 186 ns 186 ns 5
BM_gemm_blas/5/5/5_stddev 0.018 ns 0.019 ns 5
BM_gemm_blas/10/10/10_mean 265 ns 265 ns 5
BM_gemm_blas/10/10/10_median 265 ns 265 ns 5
BM_gemm_blas/10/10/10_stddev 0.018 ns 0.018 ns 5
BM_gemm_blas/20/20/20_mean 727 ns 727 ns 5
BM_gemm_blas/20/20/20_median 729 ns 729 ns 5
BM_gemm_blas/20/20/20_stddev 2.33 ns 2.33 ns 5
BM_gemm_blas/30/30/30_mean 1738 ns 1738 ns 5
BM_gemm_blas/30/30/30_median 1738 ns 1738 ns 5
BM_gemm_blas/30/30/30_stddev 1.58 ns 1.59 ns 5
@giaf I appreciate if you do independent benchmarking. I have created a separate repo with my benchmark code here: https://github.com/mkatliar/blasfeo_benchmark. Maybe it is a good place to start with and add your independent benchmarks.
As investigated here https://github.com/mkatliar/blasfeo_benchmark/issues/4#issuecomment-534905197, the issue is due to the presence of denormals in the memory passed to BLASFEO in order to create the matrices. In case of matrices of size not multiple of the minimum kernel size (usually 4x4), the matrix multiple of 4x4 is actually used for computation, and the extra computed elements discarded when writing the result form registers into memory. Therefore, if there are denormals on the "dark" part of the memory of the matrix multiple of 4x4, they do not affect the correctness of the result, but they may severely slow down computations like in this case.
To answer to why there are denormals at all in the memory (and consistently across different test machines and OSs), this may be due to what the google benchmark itself does.
It is BLASFEO's responsibility that the matrix structure after blasfeo_create_dmat()
is properly functional. The client's responsibility is to allocate the memory and properly set elements of the matrices.
I don't consider this issue as "Closed". To fix the issue, blasfeo_create_dmat()
needs to be changed such that the padding elements are initialized to 0.
We will write a "best practice" in the BLASFEO repo, but what you propose is not the solution.
If you know what you do, you can put already the data in the memory and the pass it to the blasfeo_create_dmat
, which simply puts a BLASFEO struct around it to be able to call BLASFEO API routines. We use this trick in some of our projects (as I know what to do, since I wrote it). Setting to zero the "padding" memory would break this.
On top of that, blasfeo_create_dmat
is very light weight, and does not touch the data at all also for performance reasons. E.g. in acados, the workspace is allocated once, but BLASFEO matrices are repeatedly created on that memory. Memory allocations happens "off-line" and therefore it is cheap to zero it out there, while the BLASFEO structs creation also happens "on-line" and therefore it should be cheap by design.
If you know what you do, you can put already the data in the memory and the pass it to the
blasfeo_create_dmat
, which simply puts a BLASFEO struct around it to be able to call BLASFEO API routines.
"What to do" depends on the internal storage format of the matrices and the panel size. This can change depending on the BLASFEO target architecture and the internal implementation of BLASFEO, which as we know can undergo changes. The layout of the allocated memory is not documented and is unknown to the library user. The user should be abstracted from this knowledge, and this is the purpose of the blasfeo_dmat
struct and the related functions.
The way you do it creates potential problems for potentially many users in many places. Instead of solving the problem in one place and making the user safe (and saving his of her lifetime), you create these potential problems, for the sake of vague performance considerations.
"What to do" depends on the internal storage format of the matrices and the panel size. This can change depending on the BLASFEO target architecture and the internal implementation of BLASFEO, which as we know can undergo changes. The layout of the allocated memory is not documented and is unknown to the library user. The user should be abstracted from this knowledge, and this is the purpose of the
blasfeo_dmat
struct and the related functions.
Sure it is undocumented exactly because it is unsafe. But nonetheless since we are the developers of BLASFEO, we can use knowledge about its internal in our own software.
The way you do it creates potential problems for potentially many users in many places. Instead of solving the problem in one place and making the user safe (and saving his of her lifetime), you create these potential problems, for the sake of vague performance considerations.
Just to be clear, the results computed by the routine are CORRECT. This does not create potential problems for potentially many users in many places
.
Computational speed can be lower if some good practice rules are not followed. We were aware about slight performance drops in case of not aligned memory, but they are in the order of some percent, and therefore not of interest for the normal user.
We were not aware of the large performance penalty (in the order of 10x or more) that can arise from having denormals in the padding memory. Now that we are aware of this, we will add a note to make sure that other users can avoid this issue.
We thank you for your performance issue report and for the effort and help in finding the reason for that. The lifetime you invested there will not be wasted.
Attached is the source code of a benchmark which measures execution time of the
gemm
BLAS routine implemented in BLASFEO and in a BLAS library of choice for different matrix sizes.gemm-benchmark.zip
The benchmark uses the Google Benchmark library.
Running the benchmark with the following command line
gives the following output:
One can see that, according to the benchmark, the execution time of BLASFEO
dgemm()
varies a lot, whereas the execution time of theopenblas
implementation is very stable.INTEL_HASWELL
architecture.g++ (Ubuntu 8.3.0-6ubuntu1) 8.3.0
Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
CPU (Skylake architechure).Ubuntu Linux 19.04
with5.0.0-29-generic
kernel.