BUG: matmul (@ overload) and dot significant performance differences for non-contiguous arrays (noblas)

merny93 commented 1 year ago

Describe the issue:

The implementation of matmul does some basic checks to verify if the array can be passed to gemm and if it deems that this is not posssible it will fallback on a noblas routine which has no regard for memory cache resulting in ~100 slowdowns. Here is the offending bit of source code.

dot on the other hand is a lot more flexible, attempting to make a copy before passing the array to the blas routine. For small arrays the difference is not significant but for large arrays this results in much better performance. This behavior is explicitly seen in the source code (line 234 and 243 are the bad-stride copies) and can be confirmed by profiling as shown by hpaulj.

Neither of the docs pages for dot or matmul reference this behavior. On the contrary the dot page states: If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a @ b is preferred.

There are a handful of similar issues open:

23123
complextype
views
23260

I am not sure what the best fix is. Updating the docs would be a good start. Perhaps matmul should try to copy too? at least for large arrays...

Reproduce the code example:

import numpy as np
from timeit import timeit
N = 2600
xx = np.random.randn(N, N) 
yy = np.random.randn(N, N) 

x = xx[::2, ::2]
y = yy[::2, ::2]
assert np.shares_memory(x, xx)
assert np.shares_memory(y, yy)

dot = timeit('np.dot(x,y)', number = 10, globals = globals())
matmul = timeit('np.matmul(x,y)', number = 10, globals = globals())

print('time for np.matmul: ', matmul)
print('time for np.dot: ', dot)

Error message:

time for np.matmul:  29.04214559996035
time for np.dot:  0.2680714999441989

Runtime information:

> import sys, numpy; print(numpy.__version__); print(sys.version)
1.24.2
3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)]
> print(numpy.show_runtime())
[{'simd_extensions': {'baseline': ['SSE', 'SSE2', 'SSE3'],
                      'found': ['SSSE3',
                                'SSE41',
                                'POPCNT',
                                'SSE42',
                                'AVX',
                                'F16C',
                                'FMA3',
                                'AVX2'],
                      'not_found': ['AVX512F',
                                    'AVX512CD',
                                    'AVX512_SKX',
                                    'AVX512_CLX',
                                    'AVX512_CNL',
                                    'AVX512_ICL']}},
 {'architecture': 'Prescott',
  'filepath': 'C:\\Users\\______\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\numpy\\.libs\\libopenblas64__v0.3.21-gcc_10_3_0.dll',
  'internal_api': 'openblas',
  'num_threads': 24,
  'prefix': 'libopenblas',
  'threading_layer': 'pthreads',
  'user_api': 'blas',
  'version': '0.3.21'}]
None

Context for the issue:

The scientific computing community regularly uses the @ shorthand on large arrays. Developers expect that with Numpy they do not need to think about CPU cache and expect operations to run close to CPU limits rather than memory bandwidth limits.

At the very least a warning should be provided to users such that they know why their code is running orders of magnitude slower than expected.

rgommers commented 1 year ago

Thanks for the details report @merny93!

I am not sure what the best fix is. Updating the docs would be a good start. Perhaps matmul should try to copy too? at least for large arrays...

The difference is so large that I'd say that the matmul implementation should be changed to match dot here. Documenting this is probably not as helpful; only useful if we figure out that we can't fix this in a reasonable timescale and ask the user to do the copy themselves to avoid the performance penalty.

johnstill commented 1 year ago

This is a large problem for me as well - not from caching concerns, but from threading. I work with large matrices on multi-core machines, and if sometimes np.matmul properly uses all my cores (having delegated to MKL) and sometimes it only uses one core - the noblas fallback is apparently single-threaded - this has seriously bad implications for the performance of my code. I can personally just default to always using np.dot instead of np.matmul but I'm seeing @ operator more and more in the code of my dependencies, where I don't easily have the ability to defend against this fallback behavior from np.matmul.

EDIT: By "bad implications for the performance of my code" I mean I've observed exactly this happening while trying to run research code, which is what brought me to the bugs forum to search for why matmul seems to be single-threaded even though I have a good multi-core blas installed.

ganesh-k13 commented 1 year ago

I was taking a look at this, hoping to fix it. Based on the call graph we hit this piece of code that has a comment: https://github.com/numpy/numpy/blob/6f55bbf049db3fd50994a97ec66665f5685ec5be/numpy/core/src/umath/matmul.c.src#L492-L497

@mattip (original commit via gh-12365), if we make it ccontiguous before checking is_blasable2d, fix the issue? That being said, I could not find how dot is making it ccontiguous. Does @array_function_from_c_func_and_dispatcher make a copy before the dispatch?

xor2k commented 1 year ago

I was taking a look at this, hoping to fix it. Based on the call graph we hit this piece of code that has a comment:

https://github.com/numpy/numpy/blob/6f55bbf049db3fd50994a97ec66665f5685ec5be/numpy/core/src/umath/matmul.c.src#L492-L497

@mattip (original commit via gh-12365), if we make it ccontiguous before checking is_blasable2d, fix the issue? That being said, I could not find how dot is making it ccontiguous. Does @array_function_from_c_func_and_dispatcher make a copy before the dispatch?

Good research so far! dot is making it C continuous by calling PyArray_NewCopy: https://github.com/numpy/numpy/blob/6f55bbf049db3fd50994a97ec66665f5685ec5be/numpy/core/src/common/cblasfuncs.c#L233-L241

I think the most straightforward approach could be to just try to make a temporary copy in matmul.c.src, around line 497 and only if that copy fails (due to not enough memory), resort to the matmul_inner_noblas call. However, due to the structure of ufunc (and as far as I understand it), the references to the original PyArray objects are gone, which, in the implementation of e.g. cblas_matrixproduct would still be available. So one would either need to somehow

construct a new PyArray in matmul.c.src, around line 497 (e.g. with PyArray_NewFromDescr) or
resort to a more low-level approach like some manual malloc and make strides C continuous with loops or
move the is_blasable2d check up to a function call where the array pointers are still available or
pass the array pointers down somehow to do it similarly to cblas_matrixproduct

Did I get that right so far?

seberg commented 1 year ago

You can just return an error if it fails due to not enough memory. But you do need to allocate memory in the inner loop there (for each call of the inner-loop).

is the only workable thing of those. We should have some copy functions that could be re-used in the iterator code (would have to search a bit also).

The only alternative might be to check whether we can force the contiguity in the iterator. But that would probably need some awkward iterator flags, so I am not sure how worthwhile it is.

mattip commented 1 year ago

The last line in the comment of gh-12365 agrees with @seberg that the inner loop should be doing the copying

We could adopt the strategy from linalg where the inner loops copy data if needed, rather than at the entrance

I think this would be a non-trivial undertaking.

numpy / numpy