Matrix multiplication: Nested parallelism

cc @laurae2

On benchmark on dual Xeon Gold 6154 vs MKL:

Warmup: 0.9943 s, result 224 (displayed to avoid compiler optimizing warmup away)

A matrix shape: (M: 2304, N: 2304)
B matrix shape: (M: 2304, N: 2304)
Output shape: (M: 2304, N: 2304)
Required number of operations: 24461.181 millions
Required bytes:                   42.467 MB
Arithmetic intensity:            576.000 FLOP/byte
Theoretical peak single-core:    118.400 GFLOP/s
Theoretical peak multi:         4262.400 GFLOP/s
Make sure to not bench Apple Accelerate or the default Linux BLAS.

Intel MKL benchmark
Collected 100 samples in 0.658 seconds
Average time: 6.211 ms
Stddev  time: 2.274 ms
Min     time: 5.648 ms
Max     time: 28.398 ms
Perf:         3938.203 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

Laser production implementation
Collected 100 samples in 4.067 seconds
Average time: 40.303 ms
Stddev  time: 12.542 ms
Min     time: 35.367 ms
Max     time: 121.945 ms
Perf:         606.927 GFLOP/s

Display output[0] to make sure it's not optimized away
566.68505859375

PyTorch Glow: libjit matmul implementation
Collected 100 samples in 36.837 seconds
Average time: 368.372 ms
Stddev  time: 3.071 ms
Min     time: 362.655 ms
Max     time: 380.193 ms
Perf:         66.403 GFLOP/s

Display output[0] to make sure it's not optimized away
566.6849975585938

According to the paper

[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication Smith et al

http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

Parallelism should be done around jc (dimension nc)

2018-12-26_12-10-01

Note that nc is often 4096 so we might need another distribution scheme.

mratsim / laser

Matrix multiplication: Nested parallelism #9