sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Something wrong with vector? #33

Closed hbghhy closed 7 years ago

hbghhy commented 7 years ago

I just test a toy code in spark 2.1.1. Then it report:

Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(vFeatures)' due to data type mismatch: argument 1 requires vector type, however, 'vFeatures' is of vector type.;; 'Project [id#9, features#10, vFeatures#11, clicked#12, UDF(vFeatures#11) AS buckedFeatures#87] +- Project [_1#0 AS id#9, _2#1 AS features#10, _3#2 AS vFeatures#11, _4#3 AS clicked#12] +- LocalRelation [_1#0, _2#1, _3#2, _4#3]

I have seen the sourse code , is it because the transform in the ml version of MDLP call this issue:

val discModel = new feature.mdlp_discretization.DiscretizerModel(splits)
val discOp = udf { discModel.transform _ }
dataset.withColumn($(outputCol), discOp(col($(inputCol))).as($(outputCol), metadata))

And in the sprak2 the vector in ml version should be, but the discModel.transform need org.apache.spark.mllib.linalg.Vector. So make the above error?

Here is the toy code

import org.apache.spark.sql.SparkSession

object test {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .appName("GroupStringIndexer module test")

    val data = Seq(
      (7, 1,Vectors.dense(0.0, 0.0, 18.0, 1.0), 1.0),
      (8, 1,Vectors.dense(0.0, 1.0, 12.0, 0.0), 0.0),
      (9, 0,Vectors.dense(1.0, 0.0, 15.0, 0.0), 0.0)

    val df = spark.sqlContext.createDataFrame(data).toDF("id", "features","vFeatures", "clicked")

    val discretizer = new MDLPDiscretizer()

    val result =

I just try to fix that by @sramirez THX

sramirez commented 7 years ago

Yes, correct. We focused on splits generation, but forgot the transformation part. Fixed.