sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

maxBins not honored #7

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

I set the maxBins to 2 (1 gives an error) like this:

val discretizer = new MDLPDiscretizer()
      .setMaxBins(2)
      .setMaxByPart(10000)
      .setInputCol("features")  // this must be a feature vector
      .setLabelCol(labelColumn )
      .setOutputCol("bucketFeatures")

and there were 3 bins produced with these splits: "16.1, 21.05, 30.95, Infinity" Shouldn't there be at most 2 bins? This is the same result I get if I specify maxBins as 1000.

barrybecker4 commented 8 years ago

I made a change for #6 so that there is an initial -Infinity split, and now the splits are

"-Infinity, 16.1, 21.05, Infinity"

This is better, but I was really expecting just a single non-infinite cutpoint between the two sentinel infinities.

sramirez commented 8 years ago

It seems that in lines 155 and 191, there is a bug:

val maxPoints = maxBins + 1

The number of cut points should be:

val maxPoints = maxBins - 1

I'll fix this bug as soon as possible.

barrybecker4 commented 8 years ago

Sergio, I believe you fixed this, correct? I sync'd from upstream and get the fix. Looks fine to me. I think we can close this.