Matrix Inverse CUDA generation - no kernels loaded

leratojeffrey commented 10 years ago

Hi, I am trying to generate a CUDA code for DenseMatrix.inv statement in OptiML and I realized that in the DenseMatrixOps the DenseMatrixInverse case class inherits/extends a DeliteOpSingleWithManifest, which according to my experience will allow emitting only a sequential/Scala code as it is not a parallel op. I maybe wrong about this as I am still getting to understand some of these things.

If I am right about this, do you think it's possible to try implementing DenseMatrixInverse using DeliteOpIndexedLoop or DeliteOpForEach with the concept of Guass-Jordan elimination algorithm. I have tried this with pure CUDA and it seems to work fine although I have not compared any speed-ups with the sequential pure C version. My plan was to try it first but my advisors suggested I find out first before any attempt. Please advice.

Here is the code I tried on OptiML and my new DSL (OptiSDR), which currently adotpts/inherits most functionality from OptiLA.

        val m1 = DenseMatrix.rand(10000,4250)
        val invm1 = m1.inv

hyouklee commented 10 years ago

Hi Lerato,

You're right that the current implementation of the matrix inverse is sequential, and therefore CUDA kernel will not be generated. I think it's worth trying to implement using parallel ops as you mentioned if it's not too complicated. One thing to note is that there are existing CUDA libraries you can use to calculate the matrix inverse, and I'm not sure if using the Delite parallel ops would perform poorly compared to those implementations.

leratojeffrey commented 10 years ago

Thanks Lee, I will try it using Delite parallel ops and let you Guys know soon what I came up with.

stanford-ppl / Delite

Matrix Inverse CUDA generation - no kernels loaded #39