stratosphere / incubator-systemml

Mirror of Apache SystemML (Incubating)
Apache License 2.0
1 stars 4 forks source link

Implement Linear Regression for hybrid_flink mode #4

Closed akunft closed 8 years ago

akunft commented 8 years ago

In order to create the first PR, we should identify the required implementation work for an end-to-end execution of LinregCG / LinregCG and discuss how to split the work here.

It would be best if both of you @fschueler & @carabolic have a look at this (I can only assign one :).

fschueler commented 8 years ago

I am looking into the flink and hybrid_flink execution modes, @carabolic is working on the necessary instructions.

fschueler commented 8 years ago

I added a test that allows us to run the Linear Regression DML script (direct solver). For hybrid_flink mode, the LinearRegDS.dml script already works (only uses Flink reblock instructions).

I think we should also get the GLM-predict.dml script to run for our end-to-end example. This script uses (in hybrid_spark mode) a couple more instructions that are not yet implemented in Flink:

The other instructions are MatrixScalarArithmeticInstructions ("*" and "/") that we should already have. We might need to add the ArithmeticInstruction abstraction similar to Spark.

I think getting this to work in hybrid_flink mode should be the first step. We can then add the instruction for pure flink mode.

fschueler commented 8 years ago

It turns out that the number of Flink instructions increases significantly during recompilation for the GLM-predict.dml script (matrix-indexing, relationalbinary, ...)

Should we add these or make the PR only for the LinearReg*.dml scripts?

fschueler commented 8 years ago

it could actually be a bug that I introduced... I am investigating! :eyeglasses:

fschueler commented 8 years ago

Unfortunately I think it's not a bug, same happens for Spark. So we can either implement all missing instructions for the GLM-predict.dml or have a PR for only the other scripts...

fschueler commented 8 years ago

I think this is done and we should focus on testing and cleanup now. One thing that we should resolve for the PR is #15 - when running in hybrid_flink mode on a cluster this will probably be needed.