Matrix multiplication much slower for ints than floats

lesshaste commented 4 years ago

Multiplying two matrices is much slower when the matrices have dtype int64 than when they have dtype float64. This is even true if you include the time to first convert the matrices to floats and then back again.

Reproducing code example:

import numpy as np
a = np.arange(1000 * 1000, dtype=np.int64).reshape((1000,1000))
b = a.copy()
%timeit a @ b
%timeit (a.astype(np.float64) @ b.astype(np.float64)).astype(np.int64)
afloat = a.astype(np.float64) 
bfloat = b.astype(np.float64)
%timeit afloat @ bfloat

The results on my PC are:

%timeit (a.astype(np.float64) @ b.astype(np.float64)).astype(np.int64) 37 ms ± 948 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit a @ b 2.44 s ± 49.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

afloat = a.astype(np.float64) bfloat = b.astype(np.float64) %timeit afloat @ bfloat 21.6 ms ± 521 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Error message:

Numpy/Python version information:

1.17.2 3.5.2 (default, Jul 10 2019, 11:58:48) [GCC 5.4.0 20160609]

lesshaste commented 4 years ago

@francoislauger also reports that:

it's seems that MKL have now a integer multiplication matrix function : https://software.intel.com/en-us/mkl-developer-reference-c-cblas-gemm-1 Here someone think thats it's since MKL 2018 : https://in.mathworks.com/matlabcentral/answers/478716-mkl-2018-supposedly-supports-integer-matrix-multiplication-can-this-feature-be-added-to-matlab I don't know if NumPy can use specific function of MKL when available

seberg commented 4 years ago

I think it is not impossible, but right now our cblas backend is runtime switchable, such a change would not be. So I think we would need to develop some infrastructure first. I would not be surprised if Intel/Anaconda already think about shipping a patched version of NumPy... I am inclined that we could close this and maybe make sure we have a general tracking issue related to such "backend" issues.

eric-wieser commented 4 years ago

For reference, today we use blas only for single, double, csingle and cdouble:

https://github.com/numpy/numpy/blob/00066746f9c388f468f0683003f30948b80443b0/numpy/core/src/umath/matmul.c.src#L374-L387

numpy / numpy