libmir / mir-glas

[Experimental] LLVM-accelerated Generic Linear Algebra Subprograms
103 stars 10 forks source link

[AMD] mir-glas is slower than OpenBLAS for DGEMM #20

Open MigMuc opened 7 years ago

MigMuc commented 7 years ago

I suceesfully compiled the benchmark gemm_report.d provided by mir-glas. I ran it twice. One comparing with OpenBLAS and another comparing against ACML-5.3.1. As you can see from the benchmarks mir-glas does not yield full performance for large matrices. Peak performance for my machine is about 23 GFLOPs for double precision. But also ACML does noch achieve full performance. So I decided to compare with dgemm.goto and dgemm.acml benchmark programs provided in OpenBLAS/benchmark. Here ACML reaches peak performance too. Is there any overhead calling ACML from D? dgemm_bench print

MigMuc commented 7 years ago

I have llvm version 3.9.1 installed.

9il commented 7 years ago

Hey @MigMuc,

Is there any overhead calling ACML from D?

No, only cblas_dgemm CBLAS function are called.

I have never tested GLAS on AMD CPUs. Would be awesome to have benchmarks for AMD. Benchmarks can be posted in the blog

Is AMD FX(TM)-4300 @ 3.8 GHz your CPU?

Possible factors that may influence performance:

  1. Computation kernel structure.
  2. CPU Cache usage by BLAS and other programs. You may want to close web browser and other programs to get correct benchmarks.
  3. Matrix transposition.
  4. Strange thermal behaviour.

Lets start with computation kernels to optimize GLAS.

OpenBLAS uses sgemm_kernel_16x2_piledriver. This is strange because this kernel do not use YMM registers, only XMM registers. Maybe Piledriver YMM are simulated on top of XMM?

To see GLAS DGEMM kernel comile this gist with -output-s flag. Command line example is in the first line. The example is for SGEMM, replace float[8] with double[4] to generate DGEMM kernel.


MigMuc commented 7 years ago

Hi @9il,

Is AMD FX(TM)-4300 @ 3.8 GHz your CPU?

Yes, it has a Piledriver core. So in order to compile the gemm_micro_kernel.d I used the -mcpu=bdver2 flag after exchanging

12 export extern(C)
13 auto dot_reg_basic_generic(
14     const(__vector(float[8])[2][1])* a,
15     const(float[1][6])* b,
16     size_t length,
17     ref __vector(float[8])[2][1][6] c,
18 )
19 {
20     return dot_reg_basic(a, b, length, c);
21 }


12 export extern(C)
13 auto dot_reg_basic_generic(
14     const(__vector(double[4])[2][1])* a,
15     const(float[1][6])* b,
16     size_t length,
17     ref __vector(double[4])[2][1][6] c,
18 )
19 {
20     return dot_reg_basic(a, b, length, c);
21 }

I got the following result:


9il commented 7 years ago

Please replace float with double for b

MigMuc commented 7 years ago


RoyiAvital commented 7 years ago

Can one use mir-glas on Windows for C \ C++ Projects using Visual Studio?

9il commented 7 years ago

@RoyiAvital, yes. It has C headers. Note, that it is single thread for now.

RoyiAvital commented 7 years ago

@9il , I'm interested in Small Matrices Linear Algebra library. Hence I'm OK, for now, with Single Threaded implementation.

Is there a guide or examples how to use it from C Code under Windows?

Thank You.

9il commented 7 years ago

@RoyiAvital ,

  1. Build the library using dub package manager
  2. Include it into your project as common C library.
  3. Include headers into your project.

See also examples folder.

MigMuc commented 7 years ago

I spent some time doing benchmark tests and here they are: bench_sgemm bench_dgemm bench_cgemm bench_zgemm

RoyiAvital commented 7 years ago

@MigMuc, Could you please add label for the axis? I'm not sure if higher or lower is better.

Thank You.

MigMuc commented 7 years ago

As you can see the performance varies quite a bit, specially AMDs own ACML is really weak on single complex performance, where GLAS is the best. But there are two cases where GLAS could be substantially improved, i.e. for single and double precision cases.

Regarding the implementation of gemm in GLAS as far as I can see there are a few lines in glas/internal/gemm.d

auto re = s[0] reg[n][0][m]; auto im = s[0] reg[n][1][m]; re -= s[1] reg[n][1][m]; im += s[1] reg[n][0][m]; reg[n][0][m] = re; reg[n][1][m] = im;

Is this the 1m implementation from BLIS for complex arithmetic? I would like to test some blocking parameters, for example testing the blocking like in Where can I set these parameters? Do you have any sugesstions about how to proceed?

RoyiAvital commented 7 years ago

Any chance having Intel MKL there as well?

Thank You.

MigMuc commented 7 years ago

This is an AMD CPU so I guess Intel MKL would not be optimized for this case. Probably it would work on this machine but I don't have MKL installed.

MigMuc commented 7 years ago

@RoyiAvital: BTW, do you have any benchmarks you could provide? It would be great to have some comparisons also with Intel CPUs as well.

RoyiAvital commented 7 years ago

I have done some Intel MKL vs. OpenBLAS using MATLAB and Julia.

Have a look at Benchmark MATLAB & Julia for Matrix Operations.

But now I'm mostly interested in small matrices (Up to ~1000 elements) performance.

MigMuc commented 7 years ago

Some time ago I did some benchmark testing with gemm. I would like to debug the gemm_example.d in the examples folder in order to know the blocking sizes of this particular CPU as caclulated from the mir-cpuid packge and compare them with the blocking sizes of OpenBLAS and BLIS. Therefore I changed the build type from --build=target-native to --build=debug in the dub.json file. But then I get linker errors:

The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "debug" build using ldmd2 for x86_64.
mir-algorithm 0.6.13: target for configuration "library" is up to date.
mir-cpuid 0.5.2: target for configuration "library" is up to date.
gemm_example ~master: building configuration "application"...
Running pre-build commands...
The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "debug" build using ldmd2 for x86_64.
mir-glas 0.2.3: building configuration "static"...
Compiling ../source/glas/precompiled/context.d...
Compiling ../source/glas/precompiled/l1d.d...
Compiling ../source/glas/precompiled/l1s.d...
Compiling ../source/glas/precompiled/l1c.d...
Compiling ../source/glas/precompiled/l1z.d...
Compiling ../source/glas/precompiled/l3c.d...
Compiling ../source/glas/precompiled/l3d.d...
Compiling ../source/glas/precompiled/l3s.d...
Compiling ../source/glas/precompiled/l3z.d...
Compiling ../source/glas/precompiled/utility.d...
The determined compiler type "ldc" doesn't match the expected type "dmd". This will probably result in build errors.
Performing "release-nobounds" build using ldmd2 for x86_64.
mir-cpuid 0.5.2: building configuration "library"...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/amd.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/common.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/unified.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/intel.d...
Compiling ../../../../../.dub/packages/mir-cpuid-0.5.2/mir-cpuid/source/cpuid/x86_any.d...
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas//libmir-glas.a(../.dub/build/static-debug-linux.posix-x86_64-ldc_2074-68AAD8DD4EB442FD2FE09072820FEAE2/home.miguel.Dokumente.DLang.mir-glas-0.2.3.mir-glas.source.glas.precompiled.context.d.o): In Funktion »_D4glas11precompiled7context6memoryFNbNimZAv«:
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas/examples/../source/glas/precompiled/context.d:120: Warnung: undefinierter Verweis auf »_D4glas8internal6memory10deallocateFNbNiAvZb«
/home/miguel/Dokumente/DLang/mir-glas-0.2.3/mir-glas/examples/../source/glas/precompiled/context.d:121: Warnung: undefinierter Verweis auf »_D4glas8internal6memory15alignedAllocateFNbNiNemkZAv«
collect2: Fehler: ld gab 1 als Ende-Status zurück
Error: /usr/bin/gcc failed with status: 1
ldmd2 failed with exit code 1.``

What can I do in order to compile the whole package with debug info?
9il commented 7 years ago

GLAS building system was created with assumption that it always builds in release mode. Half of files just never compiles because of all functions are marked as always inlined.

I recommend to use C's printf to find the required information or fix the build configuration to compile and link required files.