Improve performance of GCXS dot ndarray

jcapriot commented 6 months ago

Improves the memory access pattern for GCXS arrays dot ndarrays.

Was running a few benchmarks of the GCXS class against scipy's csr/csc arrays and noticed that they were significantly slower. So was curious what the biggest difference was as their structures shouldn't be too fundamentally different. When looking at the csr_dot_ndarray code and the csc_dot_ndarray code I realized that the access pattern of the arrays was inefficient.

Here are some of the relevant timings of things:

import scipy.sparse as sp
import sparse
import numpy as np

state = np.random.default_rng(seed=1337)
A_gc0s = sparse.random((9000, 10000), format='gcxs', random_state=state).change_compressed_axes((0,))
A_gc1s = A_gc0s.change_compressed_axes((1,))
A_csr = A_gc0s.to_scipy_sparse()
A_csc = A_gc1s.to_scipy_sparse()

n_vecs = 20
n_broadcasted = 5

v = state.random((A_gc0s.shape[1], n_vecs))
u = state.random((n_broadcasted, A_gc0s.shape[1], n_vecs))
x = state.random((n_vecs, A_gc0s.shape[0]))

Scipy Timing:

%timeit A_csr @ v

18.1 ms ± 1.32 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit A_csc @ v

19 ms ± 1.03 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit x @ A_csr

21.9 ms ± 3.39 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit x @ A_csc

21 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Sparse timing:

# trigger compilers
A_gc0s @ v
A_gc1s @ v
x @ A_gc0s
x @ A_gc1s
A_gc0s @ u
A_gc1s @ u;

Timing on main:

%timeit A_gc0s @ v

58.4 ms ± 4.67 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit A_gc1s @ v

179 ms ± 7.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit A_gc0s @ u

336 ms ± 44.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit A_gc1s @ u

1.06 s ± 80 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit x @ A_gc0s

126 ms ± 7.89 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit x @ A_gc1s

55.7 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing on PR:

%timeit A_gc0s @ v

19.8 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit A_gc1s @ v

23.4 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit A_gc0s @ u

65 ms ± 5.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit A_gc1s @ u

77 ms ± 5.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit x @ A_gc0s

32.8 ms ± 3.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit x @ A_gc1s

57.6 ms ± 3.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

The left ndarray dot GCXS arrays are still a little slower than I'd like, but couldn't quite see where to dive into that at, as it looks like it's doing what it should be for the matvec operation (basically just doing (GCXS.T @ x.T).T)

One other thing, for 1D ndarrays numba doesn't seem to be able to optimize these as well as it could (considering benchmarks with scipy csr/csc times a 1D array). There should likely be a branching for 1D ndarrays to use a version that doesn't include the loop over the second dimension.

jcapriot commented 6 months ago

Thank you for the changes! Would you be willing to write benchmarks (examples in the benchmarks directory, see https://asv.readthedocs.io/en/stable/writing_benchmarks.html)

Sure!

codecov[bot] commented 6 months ago

Codecov Report

Merging #643 (c06b832) into main (a1d2081) will decrease coverage by 0.02%. The diff coverage is 100.00%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #643 +/- ## ========================================== - Coverage 90.22% 90.21% -0.02% ========================================== Files 20 20 Lines 3674 3670 -4 ========================================== - Hits 3315 3311 -4 Misses 359 359 ```

pydata / sparse