Open MigMuc opened 7 years ago
I have llvm version 3.9.1 installed.
Hey @MigMuc,
Is there any overhead calling ACML from D?
No, only cblas_dgemm CBLAS function are called.
I have never tested GLAS on AMD CPUs. Would be awesome to have benchmarks for AMD. Benchmarks can be posted in the blog https://github.com/libmir/blog.
Is AMD FX(TM)-4300 @ 3.8 GHz your CPU?
Possible factors that may influence performance:
Lets start with computation kernels to optimize GLAS.
OpenBLAS uses sgemm_kernel_16x2_piledriver. This is strange because this kernel do not use YMM registers, only XMM registers. Maybe Piledriver YMM are simulated on top of XMM?
To see GLAS DGEMM kernel comile this gist with -output-s
flag. Command line example is in the first line. The example is for SGEMM, replace float[8]
with double[4]
to generate DGEMM kernel.
Thanks!
Hi @9il,
Is AMD FX(TM)-4300 @ 3.8 GHz your CPU?
Yes, it has a Piledriver core.
So in order to compile the gemm_micro_kernel.d
I used the -mcpu=bdver2
flag after exchanging
11
12 export extern(C)
13 auto dot_reg_basic_generic(
14 const(__vector(float[8])[2][1])* a,
15 const(float[1][6])* b,
16 size_t length,
17 ref __vector(float[8])[2][1][6] c,
18 )
19 {
20 return dot_reg_basic(a, b, length, c);
21 }
22
with
11
12 export extern(C)
13 auto dot_reg_basic_generic(
14 const(__vector(double[4])[2][1])* a,
15 const(float[1][6])* b,
16 size_t length,
17 ref __vector(double[4])[2][1][6] c,
18 )
19 {
20 return dot_reg_basic(a, b, length, c);
21 }
22
I got the following result:
Please replace float with double for b
Can one use mir-glas on Windows for C \ C++ Projects using Visual Studio?
@RoyiAvital, yes. It has C headers. Note, that it is single thread for now.
@9il , I'm interested in Small Matrices Linear Algebra library. Hence I'm OK, for now, with Single Threaded implementation.
Is there a guide or examples how to use it from C Code under Windows?
Thank You.
@RoyiAvital ,
See also examples
folder.
I spent some time doing benchmark tests and here they are:
@MigMuc, Could you please add label for the axis? I'm not sure if higher or lower is better.
Thank You.
As you can see the performance varies quite a bit, specially AMDs own ACML is really weak on single complex performance, where GLAS is the best. But there are two cases where GLAS could be substantially improved, i.e. for single and double precision cases.
Regarding the implementation of gemm in GLAS as far as I can see there are a few lines in glas/internal/gemm.d
auto re = s[0] reg[n][0][m]; auto im = s[0] reg[n][1][m]; re -= s[1] reg[n][1][m]; im += s[1] reg[n][0][m]; reg[n][0][m] = re; reg[n][1][m] = im;
Is this the 1m implementation from BLIS for complex arithmetic? I would like to test some blocking parameters, for example testing the blocking like in https://github.com/xianyi/OpenBLAS/blob/develop/kernel/x86_64/dgemm_kernel_6x4_piledriver.S. Where can I set these parameters? Do you have any sugesstions about how to proceed?
Any chance having Intel MKL there as well?
Thank You.
This is an AMD CPU so I guess Intel MKL would not be optimized for this case. Probably it would work on this machine but I don't have MKL installed.
@RoyiAvital: BTW, do you have any benchmarks you could provide? It would be great to have some comparisons also with Intel CPUs as well.
I have done some Intel MKL vs. OpenBLAS using MATLAB and Julia.
Have a look at Benchmark MATLAB & Julia for Matrix Operations.
But now I'm mostly interested in small matrices (Up to ~1000 elements) performance.
Some time ago I did some benchmark testing with gemm. I would like to debug the gemm_example.d in the examples folder in order to know the blocking sizes of this particular CPU as caclulated from the mir-cpuid packge and compare them with the blocking sizes of OpenBLAS and BLIS. Therefore I changed the build type from --build=target-native to --build=debug in the dub.json file. But then I get linker errors:
The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "debug" build using ldmd2 for x86_64.
mir-algorithm 0.6.13: target for configuration "library" is up to date.
mir-cpuid 0.5.2: target for configuration "library" is up to date.
gemm_example ~master: building configuration "application"...
Running pre-build commands...
The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "debug" build using ldmd2 for x86_64.
mir-glas 0.2.3: building configuration "static"...
Compiling ../source/glas/precompiled/context.d...
Compiling ../source/glas/precompiled/l1d.d...
Compiling ../source/glas/precompiled/l1s.d...
Compiling ../source/glas/precompiled/l1c.d...
Compiling ../source/glas/precompiled/l1z.d...
Compiling ../source/glas/precompiled/l3c.d...
Compiling ../source/glas/precompiled/l3d.d...
Compiling ../source/glas/precompiled/l3s.d...
Compiling ../source/glas/precompiled/l3z.d...
Compiling ../source/glas/precompiled/utility.d...
Linking...
The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "release-nobounds" build using ldmd2 for x86_64.
mir-cpuid 0.5.2: building configuration "library"...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/amd.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/common.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/unified.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/intel.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/x86_any.d...
Linking...
Linking...
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas//libmir-glas.a(../.dub/build/static-debug-linux.posix-x86_64-ldc_2074-68AAD8DD4EB442FD2FE09072820FEAE2/home.miguel.Dokumente.DLang.mir-glas-0.2.3.mir-glas.source.glas.precompiled.context.d.o): In Funktion »_D4glas11precompiled7context6memoryFNbNimZAv«:
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas/examples/../source/glas/precompiled/context.d:120: Warnung: undefinierter Verweis auf »_D4glas8internal6memory10deallocateFNbNiAvZb«
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas/examples/../source/glas/precompiled/context.d:121: Warnung: undefinierter Verweis auf »_D4glas8internal6memory15alignedAllocateFNbNiNemkZAv«
collect2: Fehler: ld gab 1 als Ende-Status zurück
Error: /usr/bin/gcc failed with status: 1
ldmd2 failed with exit code 1.``
What can I do in order to compile the whole package with debug info?
GLAS building system was created with assumption that it always builds in release mode. Half of files just never compiles because of all functions are marked as always inlined.
I recommend to use C's printf to find the required information or fix the build configuration to compile and link required files.
I suceesfully compiled the benchmark
gemm_report.d
provided by mir-glas. I ran it twice. One comparing with OpenBLAS and another comparing against ACML-5.3.1. As you can see from the benchmarks mir-glas does not yield full performance for large matrices. Peak performance for my machine is about 23 GFLOPs for double precision. But also ACML does noch achieve full performance. So I decided to compare with dgemm.goto and dgemm.acml benchmark programs provided inOpenBLAS/benchmark
. Here ACML reaches peak performance too. Is there any overhead calling ACML from D?