Open merny93 opened 1 year ago
Thanks for the details report @merny93!
I am not sure what the best fix is. Updating the docs would be a good start. Perhaps
matmul
should try to copy too? at least for large arrays...
The difference is so large that I'd say that the matmul
implementation should be changed to match dot
here. Documenting this is probably not as helpful; only useful if we figure out that we can't fix this in a reasonable timescale and ask the user to do the copy themselves to avoid the performance penalty.
This is a large problem for me as well - not from caching concerns, but from threading. I work with large matrices on multi-core machines, and if sometimes np.matmul
properly uses all my cores (having delegated to MKL) and sometimes it only uses one core - the noblas fallback is apparently single-threaded - this has seriously bad implications for the performance of my code. I can personally just default to always using np.dot
instead of np.matmul
but I'm seeing @
operator more and more in the code of my dependencies, where I don't easily have the ability to defend against this fallback behavior from np.matmul
.
EDIT: By "bad implications for the performance of my code" I mean I've observed exactly this happening while trying to run research code, which is what brought me to the bugs forum to search for why matmul
seems to be single-threaded even though I have a good multi-core blas installed.
I was taking a look at this, hoping to fix it. Based on the call graph we hit this piece of code that has a comment: https://github.com/numpy/numpy/blob/6f55bbf049db3fd50994a97ec66665f5685ec5be/numpy/core/src/umath/matmul.c.src#L492-L497
@mattip (original commit via gh-12365), if we make it ccontiguous before checking is_blasable2d
, fix the issue? That being said, I could not find how dot
is making it ccontiguous. Does @array_function_from_c_func_and_dispatcher
make a copy before the dispatch?
I was taking a look at this, hoping to fix it. Based on the call graph we hit this piece of code that has a comment:
@mattip (original commit via gh-12365), if we make it ccontiguous before checking
is_blasable2d
, fix the issue? That being said, I could not find howdot
is making it ccontiguous. Does@array_function_from_c_func_and_dispatcher
make a copy before the dispatch?
Good research so far! dot
is making it C continuous by calling PyArray_NewCopy
:
https://github.com/numpy/numpy/blob/6f55bbf049db3fd50994a97ec66665f5685ec5be/numpy/core/src/common/cblasfuncs.c#L233-L241
I think the most straightforward approach could be to just try to make a temporary copy in matmul.c.src
, around line 497 and only if that copy fails (due to not enough memory), resort to the matmul_inner_noblas
call. However, due to the structure of ufunc
(and as far as I understand it), the references to the original PyArray objects are gone, which, in the implementation of e.g. cblas_matrixproduct
would still be available. So one would either need to somehow
PyArray
in matmul.c.src
, around line 497 (e.g. with PyArray_NewFromDescr
) ormalloc
and make strides C continuous with loops oris_blasable2d
check up to a function call where the array pointers are still available orcblas_matrixproduct
Did I get that right so far?
You can just return an error if it fails due to not enough memory. But you do need to allocate memory in the inner loop there (for each call of the inner-loop).
The only alternative might be to check whether we can force the contiguity in the iterator. But that would probably need some awkward iterator flags, so I am not sure how worthwhile it is.
The last line in the comment of gh-12365 agrees with @seberg that the inner loop should be doing the copying
We could adopt the strategy from linalg where the inner loops copy data if needed, rather than at the entrance
I think this would be a non-trivial undertaking.
Describe the issue:
The implementation of
matmul
does some basic checks to verify if the array can be passed togemm
and if it deems that this is not posssible it will fallback on anoblas
routine which has no regard for memory cache resulting in ~100 slowdowns. Here is the offending bit of source code.dot
on the other hand is a lot more flexible, attempting to make a copy before passing the array to the blas routine. For small arrays the difference is not significant but for large arrays this results in much better performance. This behavior is explicitly seen in the source code (line 234 and 243 are the bad-stride copies) and can be confirmed by profiling as shown by hpaulj.Neither of the docs pages for
dot
ormatmul
reference this behavior. On the contrary thedot
page states: If both a and b are 2-D arrays, it is matrix multiplication, but using matmul or a @ b is preferred.There are a handful of similar issues open:
23123
23260
I am not sure what the best fix is. Updating the docs would be a good start. Perhaps
matmul
should try to copy too? at least for large arrays...Reproduce the code example:
Error message:
Runtime information:
Context for the issue:
The scientific computing community regularly uses the
@
shorthand on large arrays. Developers expect that with Numpy they do not need to think about CPU cache and expect operations to run close to CPU limits rather than memory bandwidth limits.At the very least a warning should be provided to users such that they know why their code is running orders of magnitude slower than expected.