sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Something wrong with vector? #33

Closed hbghhy closed 7 years ago

hbghhy commented 7 years ago

I just test a toy code in spark 2.1.1. Then it report:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(vFeatures)' due to data type mismatch: argument 1 requires vector type, however, 'vFeatures' is of vector type.;; 'Project [id#9, features#10, vFeatures#11, clicked#12, UDF(vFeatures#11) AS buckedFeatures#87] +- Project [_1#0 AS id#9, _2#1 AS features#10, _3#2 AS vFeatures#11, _4#3 AS clicked#12] +- LocalRelation [_1#0, _2#1, _3#2, _4#3]

I have seen the sourse code , is it because the transform in the ml version of MDLP call this issue:

val discModel = new feature.mdlp_discretization.DiscretizerModel(splits)
val discOp = udf { discModel.transform _ }
dataset.withColumn($(outputCol), discOp(col($(inputCol))).as($(outputCol), metadata))

And in the sprak2 the vector in ml version should be org.apache.spark.ml.linalg.Vector, but the discModel.transform need org.apache.spark.mllib.linalg.Vector. So make the above error?

Here is the toy code

import org.apache.spark.ml.feature.MDLPDiscretizer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.SparkSession

object test {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder().master("local[3]")
      .appName("GroupStringIndexer module test")
      .getOrCreate()

    val data = Seq(
      (7, 1,Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
      (8, 1,Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
      (9, 0,Vectors.dense(1.0, 0.0, 15.0, 0.0), 0.0)
    )

    val df = spark.sqlContext.createDataFrame(data).toDF("id", "features","vFeatures", "clicked")

    df.show()

    val discretizer = new MDLPDiscretizer()
      .setMaxBins(10)
      .setMaxByPart(10000)
      .setInputCol("vFeatures")
      .setLabelCol("clicked")
      .setOutputCol("buckedFeatures")

    val result = discretizer.fit(df).transform(df)

    result.show()
   }
}

I just try to fix that by https://github.com/sramirez/spark-MDLP-discretization/pull/34 @sramirez THX

sramirez commented 7 years ago

Yes, correct. We focused on splits generation, but forgot the transformation part. Fixed.