Delite takes longer time for Staging - Will improving System Specs help

leratojeffrey commented 10 years ago

Hi everyone,

I am familiarizing myself with Delite by implementing some basic signal processing algorithms using OptiML, however Delite takes hours to stage and execute the following simple FIR filter application:

import ppl.dsl.optiml. //Include all OptiML - Datastructure, Codegens, etc. import scala.reflect.SourceContext // We need some Scala Source Code Contexts import ppl.delite.framework.datastructures.

// Declare a Signal Generator Class object tFIRFilterRunner extends OptiMLApplicationRunner with tFIRFilter trait tFIRFilter extends OptiMLApplication {

def ConstFloatingPointFilterCoefficients(): Rep[DenseVector[Double]] =
{
    val VCOEFFS: Rep[DenseVector[Double]] = Vector(-0.0448093,0.0322875,0.0181163,0.0087615,0.0056797,0.0086685,0.0148049,0.0187190,0.0151019,0.0027594,-0.0132676,-0.0232561, -0.0187804,0.0006382,0.0250536,0.0387214,0.0299817,0.0002609,-0.0345546,-0.0525282,-0.0395620,0.0000246,0.0440998,0.0651867,0.0479110,0.0000135,-0.0508558,-0.0736313,-0.0529380,-0.0000709,0.0540186,0.0766746,0.0540186,-0.0000709,-0.0529380,-0.0736313,-0.0508558,0.0000135,0.0479110,0.0651867,0.0440998,0.0000246,-0.0395620,-0.0525282,-0.0345546,0.0002609,0.0299817,0.0387214,0.0250536,0.0006382,-0.0187804,-0.0232561,-0.0132676,0.0027594,0.0151019,0.0187190,0.0148049,0.0086685,0.0056797,0.0087615,0.0181163,0.0322875,-0.0448093)
    VCOEFFS
}

def FIRFilter(COEFFS: Rep[DenseVector[Double]],INPUT: Rep[DenseVector[Double]],LENGTH: Int ): Rep[Double] =
{
    var acc = 0d    // accumulator for MACs

    for(v <- 0 until LENGTH)
    {
        // perform the multiply-accumulate
        acc += COEFFS(v) * INPUT(v)
    }
    acc
}

def FIRFilter(COEFFS: Rep[DenseVector[Double]],INPUT: Rep[DenseVector[Double]],LENGTH: Int ): Rep[Double] =
{
    var acc = 0d    // accumulator for MACs

    for(v <- 0 until LENGTH)
    {
        // perform the multiply-accumulate
        acc += COEFFS(v) * INPUT(v)
    }
    acc
}

//Generate Sinusoidal Test Data 
// Lets Generate a Fundtion that returns a Generated Signal of Type Double 
def GenerateSignal(N:Int): Rep[DenseVector[Double]] =
{   
    val idata = DenseVector[Double](N,true)
    for(i <- 0 until N)
    {
        idata(i)=sin(2*3.1415926435897883*500*i) + .5*sin(2*3.1415926435897883*600*i) + 2*sin(2*3.1415926435897883*700*i)
    }
    idata
}

def main() = { // Get Coefficients of Length = 63 val FCoefficients = ConstFloatingPointFilterCoefficients()

// Prep. Data Length and Tap Order
val SLength = 4096
val FTapOrder = 32
val K = SLength - FTapOrder + 1
// Generate SLength Samples - 4096 Samples
val FInput = GenerateSignal(SLength)
// Declare our FIR Output
val FOutput= DenseVector[Double](K,true)
// Shift and FIR
for(SHIndex <- 0 until K)
{
    var tmpvals = FInput(SHIndex::FTapOrder+SHIndex) // Single Step Shift of Input Values without getting rid of original input values
    FOutput(SHIndex) = FIRFilter(FCoefficients,tmpvals, FTapOrder) // Lets Perform FIR Algorithm
    println(FOutput(SHIndex)) // Lets see the output
}

} }

Is this normal about Delite or am I doing something wrong? Please look at the source code and let me know if there is something I am doing wrong to piss Delite off. Delite takes a very long and I wonder if its my system specification I should be worried about. My system is Quad-Core AMD Phenom II processor with 16GB of memory and 2 NVidia GTX480, running ubuntu 12.04 64-bit. Do I need to upgrade my system?

Please help out guys, I was not sure whether it is my DSL that's very slow, thats why I had to try the same application with OptiML. I know OptiML is designed specifically for ML, but as you can see I just borrowed some of its constructs to implement this simple application.

FYI: I am using the newer version of Delite, one that's being setup with fork from the getting started OptiML Guide.

Cheers, Lerato

asujeeth commented 10 years ago

hi Lerato,

We've seen these really long staging times before, but they've been drastically improved in recent versions of Delite and LMS. Are you installing OptiML from source? If so, can you make sure that you're up to date on all of the repositories? If not, try to install the most recent version from source (http://stanford-ppl.github.io/Delite/optiml/getting_started.html#title_from source).

thanks, Arvind

On 1/17/14 8:19 AM, Lerato J. Mohapi wrote:

Hi everyone,

I am familiarizing myself with Delite by implementing some basic signal processing algorithms using OptiML, however Delite takes hours to stage and execute the following simple FIR filter application:

import ppl.dsl.optiml. //Include all OptiML - Datastructure, Codegens, etc. import scala.reflect.SourceContext // We need some Scala Source Code Contexts import ppl.delite.framework.datastructures.

// Declare a Signal Generator Class object tFIRFilterRunner extends OptiMLApplicationRunner with tFIRFilter trait tFIRFilter extends OptiMLApplication {

|def ConstFloatingPointFilterCoefficients(): Rep[DenseVector[Double]] = { val VCOEFFS: Rep[DenseVector[Double]] = Vector(-0.0448093,0.0322875,0.0181163,0.0087615,0.0056797,0.0086685,0.0148049,0.0187190,0.0151019,0.0027594,-0.0132676,-0.0232561, -0.0187804,0.0006382,0.0250536,0.0387214,0.0299817,0.0002609,-0.0345546,-0.0525282,-0.0395620,0.0000246,0.0440998,0.0651867,0.0479110,0.0000135,-0.0508558,-0.0736313,-0.0529380,-0.0000709,0.0540186,0.0766746,0.0540186,-0.0000709,-0.0529380,-0.0736313,-0.0508558,0.0000135,0.0479110,0.0651867,0.0440998,0.0000246,-0.0395620,-0.0525282,-0.0345546,0.0002609,0.0299817,0.0387214,0.0250536,0.0006382,-0.0187804,-0.0232561,-0.0132676,0.0027594,0.0151019,0.0187190,0.0148049,0.0086685,0.0056797,0.0087615,0.0181163,0.0322875,-0.0448093) VCOEFFS }

def FIRFilter(COEFFS: Rep[DenseVector[Double]],INPUT: Rep[DenseVector[Double]],LENGTH: Int ): Rep[Double] = { var acc = 0d // accumulator for MACs
 for(v <- 0 until LENGTH)
 {
     // perform the multiply-accumulate
     acc += COEFFS(v) * INPUT(v)
 }
 acc
}

def FIRFilter(COEFFS: Rep[DenseVector[Double]],INPUT: Rep[DenseVector[Double]],LENGTH: Int ): Rep[Double] = { var acc = 0d // accumulator for MACs
 for(v <- 0 until LENGTH)
 {
     // perform the multiply-accumulate
     acc += COEFFS(v) * INPUT(v)
 }
 acc
}

//Generate Sinusoidal Test Data // Lets Generate a Fundtion that returns a Generated Signal of Type Double def GenerateSignal(N:Int): Rep[DenseVector[Double]] = { val idata = DenseVectorDouble for(i <- 0 until N) { idata(i)=sin(2_3.1415926435897883_500_i) + .5_sin(2_3.1415926435897883_600_i) + 2_sin(2_3.1415926435897883_700*i) } idata } |

def main() = { // Get Coefficients of Length = 63 val FCoefficients = ConstFloatingPointFilterCoefficients()

// Prep. Data Length and Tap Order val SLength = 4096 val FTapOrder = 32 val K = SLength - FTapOrder + 1 // Generate SLength Samples - 4096 Samples val FInput = GenerateSignal(SLength) // Declare our FIR Output val FOutput= DenseVectorDouble // Shift and FIR for(SHIndex <- 0 until K) { var tmpvals = FInput(SHIndex::FTapOrder+SHIndex) // Single Step Shift of Input Values without getting rid of original input values FOutput(SHIndex) = FIRFilter(FCoefficients,tmpvals, FTapOrder) // Lets Perform FIR Algorithm println(FOutput(SHIndex)) // Lets see the output }

} }

Is this normal about Delite or am I doing something wrong? Please look at the source code and let me know if there is something I am doing wrong to piss Delite off. Delite takes a very long and I wonder if its my system specification I should be worried about. My system is Quad-Core AMD Phenom II processor with 16GB of memory and 2 NVidia GTX480, running ubuntu 12.04 64-bit. Do I need to upgrade my system?

Please help out guys, I was not sure whether it is my DSL that's very slow, thats why I had to try the same application with OptiML. I know OptiML is designed specifically for ML, but as you can see I just borrowed some of its constructs to implement this simple application.

FYI: I am using the newer version of Delite, one that's being setup with fork from the getting started OptiML Guide.

Cheers, Lerato

— Reply to this email directly or view it on GitHub https://github.com/stanford-ppl/Delite/issues/36.

TiarkRompf commented 10 years ago

One thing that might be worth checking is whether the for loops are completely unrolled at staging time. The ranges seems to be constants, and humongous generated code would explain very long staging times.

leratojeffrey commented 10 years ago

I solved this by kernel embedding the FIR filter like constructs using DeliteIndexedLoop as follows:

    .....
case class FIRAccMult[A:Manifest:Arith](x: Exp[DenseVector[A]], y: Exp[DenseVector[A]], tpodr: Exp[Int], out: Exp[DenseVector[A]]) 
extends DeliteOpIndexedLoop
{
    val size = copyTransformedOrElse(_.size)(x.length+tpodr)
    def func = i => { if(i< x.length-tpodr+1) out(i) = out(i) + x(i::tpodr+i)*:*y(unit(0)::tpodr) }

        val mA = manifest[A]
        val a = implicitly[Arith[A]]
}
   ...

Seems to work fine for both Scala and CUDA generated code but may need further improvement. Any more suggestions/tips will be appreciated.

I can now do: val firout = FIR([inputvector], [coeffs_vector], M) // M = tap order. While it is still under-construction, this is now part of OptiSDR which depends fully on OptiLA+OptiML. I am still using a version of Delite that I downloaded sometime in January. Tried updating to the new version 2 month ago but was not successful... Will try again though. Will I need to change my new Ops when I update to the latest Delite source?

leratojeffrey commented 10 years ago

Actually I am not sure if this is kernel embedded as I use Delite Parallel Ops...???

leratojeffrey commented 10 years ago

OK..Seems i dont know what that (kernel embedding) means but anyway, that's how I tried solving it guys. Any suggestions.?

asujeeth commented 10 years ago

Hi Lerato, I am quite busy this week but can take a look next week. What is the issue you are trying to debug? Is it still staging time? It is certainly worth trying to upgrade to the latest version. Please post any errors that you run into while doing so.

leratojeffrey commented 10 years ago

Hi Arvind, this issue is sorted. Please check the one I posted yesterday about Matrix Inverse. I am trying to generate a CUDA code for DenseMatrix.inv statement in OptiML and I realized that in the DenseMatrixOps the DenseMatrixInverse case class inherits/extends a DeliteOpSingleWithManifest, which according to my experience will allow emitting only a sequential/Scala code as it is not a parallel op. I maybe wrong about this as I am still getting to understand some of these things.

If I am right about this, do you think it's possible to try implementing DenseMatrixInverse using DeliteOpIndexedLoop or DeliteOpForEach with the concept of Guass-Jordan elimination algorithm. I have tried this with pure CUDA and it seems to work fine although I have not compared any speed-ups with the sequential pure C version. My plan was to try it first but my advisors suggested I find out first before any attempt. Please advice.

Here is a sample code i tried with OptiML: val m1 = DenseMatrix.rand(10000,4250) val invm1 = m1.inv Please look at it when you have time and advice. Even if its some time next week. Thank you.

stanford-ppl / Delite

Delite takes longer time for Staging - Will improving System Specs help #36