Closed rreusser closed 9 years ago
Upon thinking about this for a while, I've realized that in the first place, ndarrays are really only a thin (but effective!) wrapper around index logic, and that BLAS's support for strides and offsets are perfectly compatible. So in other words, it seems what I'm really working toward is either a simple macro layer to unroll these things (similar to cwise except designed to be integrated into algorithms like this) or just unrolling it and plugging it into real BLAS. Or just plugging ndarrays into LAPACK or something. Hmmm…
Okay, so I benchmarked this for Householder QR factorization + solve with a 150x60 matrix. The results are pretty consistent. I reused arrays, threw away the first 200 iterations, and kept the last 100.
Overgeneralized but representative answer: The original rewrite (using an algorithm that skips a couple vector multiplications and packs the storage a bit tighter) yielded ballpark 5% speedup, and unrolling the slices as described above yielded about 8-10% speedup
$ npm run bench
QR factorization + solution (150 x 60)
┌──────────────────────────────────┬────────────┬──────────────┬──────────────┬───────────────┐
│ Test │ Iterations │ Average (ms) │ Longest (ms) │ Shortest (ms) │
├──────────────────────────────────┼────────────┼──────────────┼──────────────┼───────────────┤
│ Naively implemented algorithm │ 100 │ 7.632 │ 9.496 │ 6.973 │
├──────────────────────────────────┼────────────┼──────────────┼──────────────┼───────────────┤
│ Streamlined algorithm │ 100 │ 7.496 │ 10.101 │ 6.886 │
├──────────────────────────────────┼────────────┼──────────────┼──────────────┼───────────────┤
│ Streamlined + unrolled algorithm │ 100 │ 6.469 │ 7.633 │ 5.959 │
└──────────────────────────────────┴────────────┴──────────────┴──────────────┴───────────────┘
Been playing with this for a while and am fairly confident in the correctness, so I'm gonna merge.
@mikolalysenko — I was curious how I might optimize these linear algebra routines, so I tried out unrolling the slices so that it doesn't instantiate new ndarray slices on every inner loop iteration. Instead, it just calculates a single slice at the beginning and increments the offset. (I also just set
*.data[…]
directly in some spots since.set(0,…)
is trivial — although this certainly doesn't amount to any significant speedup). I added tests using slices of high-dimensional arrays with weird steps as input, so I'm reasonably confident in the correctness. The result is significantly less maintainable but avoids lots of object allocation/garbage collection.Question: I'll profile to see if it has any affect, but at a glance, do you think this is a remotely reasonable approach once things have been developed and tested? I feel like the original version should be saved so that it's clear where this madness came from. It's basically the sort of thing cwise does, just by hand…
Here's an example: