sagemath / sage

Main repository of SageMath
https://www.sagemath.org
Other
1.38k stars 469 forks source link

Special case modn_dense matrix operations to improve performance #15104

Closed nbruin closed 7 years ago

nbruin commented 11 years ago

Presently, taking a transpose of a modn_dense matrix of smallish size is much more expensive than constructing the right kernel, because the method is just a generic implementation. We can make this much faster. Same for submatrix and stack.

CC: @simon-king-jena @nthiery

Component: linear algebra

Keywords: sd86.5

Author: Nils Bruin, Simon King

Branch: 5c878c7

Reviewer: Travis Scrimshaw

Issue created by migration from https://trac.sagemath.org/ticket/15104

nbruin commented 7 years ago

Changed keywords from none to sd86.5

nbruin commented 7 years ago

Changed work issues from regression in right_kernel_matrix to none

tscrim commented 7 years ago

Changed branch from u/nbruin/ticket/15104 to public/linear_algebra/modn_dense_speedup-15104

tscrim commented 7 years ago

Changed commit from ce2cfdb to 10fa935

tscrim commented 7 years ago
comment:46

I made some doc improvements and fixed a typo that caused this to not compile.

Using the same test order as comment:5:

10000 loops, best of 3: 93.2 µs per loop
10000 loops, best of 3: 95.5 µs per loop
100000 loops, best of 3: 3.95 µs per loop
10000 loops, best of 3: 89.3 µs per loop
100000 loops, best of 3: 8.08 µs per loop

vs develop:

10000 loops, best of 3: 132 µs per loop
10000 loops, best of 3: 133 µs per loop
100000 loops, best of 3: 2.73 µs per loop
100000 loops, best of 3: 6.69 µs per loop
100000 loops, best of 3: 8.38 µs per loop

So the right kernel matrices are faster, but not the transpose, stack (this one is really bad), and submatrix.

For the transpose, you are not using self._parent.transposed lazy attribute.

For stack, the meat of the code should be in cdef _stack_impl(self, bottom): (see matrix1.pyx).

Instead of overwriting submatrix, it would be better to define matrix_from_rows_and_columns.

However, I don't have a reason why these direct implementations are slower than their generic counterparts.

Addendum: Actually, a good part of the stack slowdown might be because you are doing the bulk of the work in a def function rather than a cdef.


New commits:

10fa935Cleanup doc and fix typo.
tscrim commented 7 years ago

Reviewer: Travis Scrimshaw

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from 10fa935 to 27f9467

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

27f9467faster construction of new parents
nbruin commented 7 years ago
comment:48

Hm, somehow the changes for better construction of parents didn't make it in. Fixed now. At least nothing is worse now. However, without improvement it's hard to argue why to make the changes. Perhaps there are other cases where differences are bigger.

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from 27f9467 to 81c7ef4

7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. New commits:

81c7ef4Use _stack_impl for the work and leave the checks for stack.
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Branch pushed to git repo; I updated commit sha1. This was a forced push. New commits:

5c878c7Use _stack_impl for the work and leave the checks for stack.
7ed8c4ca-6d56-4ae9-953a-41e42b4ed313 commented 7 years ago

Changed commit from 81c7ef4 to 5c878c7

tscrim commented 7 years ago
comment:51

By using _stack_impl, I could get some extra speed out of stack:

100000 loops, best of 3: 9.13 µs per loop
100000 loops, best of 3: 10.5 µs per loop
100000 loops, best of 3: 2.14 µs per loop
100000 loops, best of 3: 5.9 µs per loop
100000 loops, best of 3: 8.34 µs per loop

The most amazing thing is the speedup of the right kernel matrix. There was a lot of overhead there: about a 10x speedup.

If my last changes are good, then positive review.

nbruin commented 7 years ago
comment:52

Thanks! This looks fine. It seems this is a uniform improvement, and further optimizations would likely need to consider some new ideas and areas, so this is definitely worth merging.

vbraun commented 7 years ago

Changed branch from public/linear_algebra/modn_dense_speedup-15104 to 5c878c7

jdemeyer commented 7 years ago
comment:54

For future reference, code like

cdef Py_ssize_t* nonpivots = <Py_ssize_t*>sig_malloc(sizeof(Py_ssize_t)*(ncols-r))

is unsafe for 2 reasons:

  1. sig_malloc might return NULL in the case that no memory is available.

  2. The multiplication sizeof(Py_ssize_t) * (ncols-r) can overflow (this is unlikely to happen in practice for matrices, but it's still a bug)

Luckily, the function check_allocarray from cysignals solves both these problems. The allocation could be written as

cdef Py_ssize_t* nonpivots = <Py_ssize_t*>check_allocarray(ncols - r, sizeof(Py_ssize_t))
jdemeyer commented 7 years ago

Changed commit from 5c878c7 to none