error in the s/dger case (mx1 vs 1xn) in mtimes

ShigekiKarita commented 6 years ago

I met the following errors in this special case. I think using ger inside mtimes is the best.

unittest
{
    import std.stdio;
    auto a = [1.0, 2.0].sliced(2, 1);
    auto b = [1.0, 2.0].sliced(2, 1);
    auto bt = b.transposed;

    // without slice
    // (linked with cblas) ** On entry to cblas_dgemm parameter number 11 had an illegal value
    // (linked with mkl_rt) Intel MKL ERROR: Parameter 11 was incorrect on entry to cblas_dgemm.
    writeln(mtimes(a, bt)); // [[6.911e-310, 9.85365e-317], [0, 0]]

    // but OK with slice
    assert(mtimes(a, bt.slice) == [[1, 2], [2, 4]]);

    // also I think mir.blas.ger is nice when a.length!1 == 1 && b.length!0 == 1
    import mir.blas : ger;
    auto c = uninitSlice!double(a.length!0, bt.length!1);
    c[] = 0;
    ger(1.0, a[0..$,0], b[0..$,0], c);
    assert(c == [[1, 2], [2, 4]]);
}

I think this is related to the LDB (the 11th arg) specification http://www.netlib.org/lapack/explore-html/d1/d54/group__double__blas__level3_gaeda3cbd99c8fb834a60a6412878226e1.html#gaeda3cbd99c8fb834a60a6412878226e1

       On entry, LDB specifies the first dimension of B as declared
       in the calling (sub) program. When  TRANSB = 'N' or 'n' then
       LDB must be at least  max( 1, k ), otherwise  LDB must be at
       least  max( 1, n ).

where B.shape == [k, n]

P.S.

The following transposed B (1 < k < n) seems to be OK. I think only k=1 has the problem.

unittest
{
    auto a = [1.0, 2.0,
              3.0, 4.0,
              5.0, 6.0].sliced(3, 2);
    auto b = [1.0, 2.0,
              3.0, 4.0,
              5.0, 6.0].sliced(3, 2);
    auto bt = b.transposed;
    auto result = [[5, 11, 17], [11, 25, 39], [17, 39, 61]];
    // OK
    assert(mtimes(a, bt) == result);
}

ShigekiKarita commented 6 years ago

@9il Thank you for fast merge! but I still have a question on my ger example for the recurrent neural network backpropgation algorithm.

git clone https://github.com/ShigekiKarita/numir-char-rnn.git
git checkout -b  <branch-name> origin/<branch-name>
git submodule update --init --recursive # lubeck is submoduled
dub clean --all-packages
dub run --compiler=ldc2 -b=release-nobounds
dub run --compiler=ldc2 -b=release-nobounds

using lubeck.mtimes (ger disabled)

https://github.com/ShigekiKarita/numir-char-rnn/blob/mir-blas-gemm/source/app.d#L63-L72 dub run --compiler=ldc2 -b=release-nobounds 59.09s user 0.94s system 395% cpu 15.180 total

using lubeck.mtimes (ger enabled)

https://github.com/ShigekiKarita/numir-char-rnn/blob/mir-blas-ger/source/app.d#L63-L72 dub run --compiler=ldc2 -b=release-nobounds 46.70s user 0.59s system 395% cpu 11.968 total

using direct call of mir.blas.ger

https://github.com/ShigekiKarita/numir-char-rnn/blob/master/source/app.d#L84-L94 dub run --compiler=ldc2 -b=release-nobounds 21.89s user 0.20s system 390% cpu 5.661 total

Do you have any comments on the reason why this ger inside mtimes is much slower than the raw ger?

9il commented 6 years ago

@ShigekiKarita

gemm vs ger: BLAS operations are not optimized for cases like k = 1. So gemm performs gemm even if it can call ger. Other possible optimization is call gemv if only on of matrixes is a vector.
Lubeck vs mir-blas/mir-lapack: Lubeck allocates a memory for result. This memory probably not in CPU cache. So, if you preallocate all memory and use only mir-blas/mir-lapack (and without D's hashmaps) then your algorithm will be faster. Both Python and Lubeck use internal BLAS implementation. So if you want win Python you need to preallocate memory, which is not possible in numpy.

3.Take a look into this line:

auto dh = mtimes(params["Why"].transposed, dy).slice;

There are two memory allocations. The first one is mtimes, the second one is slice.

ShigekiKarita commented 6 years ago

Thanks, I see. Do you have a plan to add an optional argument to store the result in lubeck? As we can see it in numpy and torch, it is good for the preallocation strategy.

jmh530 commented 6 years ago

@9il The mtimes documentation could be adjusted to say "General matrix-matrix multiplication. Allocates result to an uninitialized slice using GC."

I don't know how many of the other functions that will be in lubeck will need to think about these sorts of issues. It probably makes sense to create an Issue for further discussion on pre-allocation or the use of alternate allocators.

9il commented 6 years ago

Thanks, I see. Do you have a plan to add an optional argument to store the result in lubeck? As we can see it in numpy and torch, it is good for the preallocation strategy.

The gemm wrapper for ndslice from mir-blas do the same.

@9il The mtimes documentation could be adjusted to say "General matrix-matrix multiplication. Allocates result to an uninitialized slice using GC."

PR is welcome. In the same time it is a general Lubeck concept. Lubeck was originally created to port a commercial Matlab library to D. Similarity, readability, and simplicity were key features. The speed was too, but it was already increased more then one hundred times compared with the original Matlab code.

I don't know how many of the other functions that will be in lubeck will need to think about these sorts of issues. It probably makes sense to create an Issue for further discussion on pre-allocation or the use of alternate allocators.

Currently @EmTee70 works on matrix classes that can hold different payloads (like symmteric, diagonal and other matrixes) and have "*" overloaded operation.

I thought a lot about RC based separate Matrix type system with clever expressions that will be felt like Julia or Matlab. Something like that:

Assume it can hold different types of matrixes:

dense
symmetric
diagonal
sparse
tridiagonal

and is clever enough, for example, to fuse at run-time an expressions like

Mat C = alpha * J * J.t - beta * B * R;

into two BLAS (openblas) calls:

syrk (11 - symmetric rank k operation) for

C = alpha * J * J.t.

and than

gemm (12 - general rank k operation) for

`C -= beta * B * R`

Plus it would be able to solve linear systems using lapack:

B /= A;

@Laeeth, @firoozye, possibly it may be a good concept for math programming in D.

jmh530 commented 6 years ago

@9il Submitted PR.

I"m glad progress is being made on general matrix classes. Would Jean-Louis Leroy's open multi-methods library be useful for these run-time features? He has Matrix examples, but I don't see ones that implement operator overloading.

kaleidicassociates / lubeck