scijs / ndarray-householder-qr

Householder triangularization for QR Factorization of ndarrays
10 stars 1 forks source link

Unroll slices to avoid garbage collection #6

Closed rreusser closed 9 years ago

rreusser commented 9 years ago

@mikolalysenko — I was curious how I might optimize these linear algebra routines, so I tried out unrolling the slices so that it doesn't instantiate new ndarray slices on every inner loop iteration. Instead, it just calculates a single slice at the beginning and increments the offset. (I also just set *.data[…] directly in some spots since .set(0,…) is trivial — although this certainly doesn't amount to any significant speedup). I added tests using slices of high-dimensional arrays with weird steps as input, so I'm reasonably confident in the correctness. The result is significantly less maintainable but avoids lots of object allocation/garbage collection.

Question: I'll profile to see if it has any affect, but at a glance, do you think this is a remotely reasonable approach once things have been developed and tested? I feel like the original version should be saved so that it's clear where this madness came from. It's basically the sort of thing cwise does, just by hand…

Here's an example:

  // Instantiate a single slice at the beginning:
  Ajmj = A.pick(null,0)
  Ajmi = A.pick(null,0)

  // Figure out the offsets and increments:
  var AjmjInc = A.stride[0] + A.stride[1]
  var AjmiOff = Ajmi.offset
  var AjmiInc1 = A.stride[0]
  var AjmiInc2 = A.stride[1]
  var dInc = d.stride[0]
  var dOff = d.offset

  for( j=0; j<n; j++, Ajmj.shape[0]--, Ajmj.offset+=AjmjInc, AjmiOff+=AjmiInc1, dOff+=dInc) {
    s = blas.nrm2(Ajmj)
    ajj = Ajmj.data[Ajmj.offset]
    dj = ajj > 0 ? -s : s 
    d.data[dOff] = dj
    s = Math.sqrt( s * (s + Math.abs(ajj)) )
    if( s === 0 ) return false
    Ajmj.data[Ajmj.offset] = ajj - dj
    blas.scal( 1/s, Ajmj)

    for(i=j+1, Ajmi.offset=AjmiOff + AjmiInc2*(j+1); i<n; i++, Ajmi.offset+=AjmiInc2) {
      s = - blas.dot( Ajmj, Ajmi )
      blas.axpy( s, Ajmj, Ajmi )
    }   
  }

  return true
}
rreusser commented 9 years ago

Upon thinking about this for a while, I've realized that in the first place, ndarrays are really only a thin (but effective!) wrapper around index logic, and that BLAS's support for strides and offsets are perfectly compatible. So in other words, it seems what I'm really working toward is either a simple macro layer to unroll these things (similar to cwise except designed to be integrated into algorithms like this) or just unrolling it and plugging it into real BLAS. Or just plugging ndarrays into LAPACK or something. Hmmm…

rreusser commented 9 years ago

Okay, so I benchmarked this for Householder QR factorization + solve with a 150x60 matrix. The results are pretty consistent. I reused arrays, threw away the first 200 iterations, and kept the last 100.

Overgeneralized but representative answer: The original rewrite (using an algorithm that skips a couple vector multiplications and packs the storage a bit tighter) yielded ballpark 5% speedup, and unrolling the slices as described above yielded about 8-10% speedup

$ npm run bench

QR factorization + solution (150 x 60)
┌──────────────────────────────────┬────────────┬──────────────┬──────────────┬───────────────┐
│ Test                             │ Iterations │ Average (ms) │ Longest (ms) │ Shortest (ms) │
├──────────────────────────────────┼────────────┼──────────────┼──────────────┼───────────────┤
│ Naively implemented algorithm    │ 100        │ 7.632        │ 9.496        │ 6.973         │
├──────────────────────────────────┼────────────┼──────────────┼──────────────┼───────────────┤
│ Streamlined algorithm            │ 100        │ 7.496        │ 10.101       │ 6.886         │
├──────────────────────────────────┼────────────┼──────────────┼──────────────┼───────────────┤
│ Streamlined + unrolled algorithm │ 100        │ 6.469        │ 7.633        │ 5.959         │
└──────────────────────────────────┴────────────┴──────────────┴──────────────┴───────────────┘
rreusser commented 9 years ago

Been playing with this for a while and am fairly confident in the correctness, so I'm gonna merge.