sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Generated splits are missing leading -Infinity #6

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

I hope that @sramirez or someone else familiar spark discretizers can tell me if this is a bug. Other discretizers produce splits that have an initial cutpoint of -Infinity, and a final cutpoint of Infinity in order to catch data outside the regular bins. The MDLPDiscretzer produces a model with splits like this for the sample data I have tried {code} 16.1, 21.05, 30.95, Infinity 5.5, Infinity 97.5, 169.5, Infinity 78.5, 134.0, Infinity 2379.5, 2959.5, Infinity 13.5, 19.5, Infinity 1980.5, Infinity {code} It looks to me like the initial -Infinity split. I think this is a bug. See the unit tests I have added on branch https://github.com/sramirez/spark-MDLP-discretization/pull/5 for more detail.

barrybecker4 commented 8 years ago

Given the below statement taken from the spark 2.0 documention, it definitely seems like there should be a leading -Infinity for all splits produced.

Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of splits are Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity) and Array(0.0, 1.0, 2.0).

sramirez commented 8 years ago

My initial design don't include -Inf as a cut point because I considered it as redundant. However, the firsst discretizer included in Spark do consider this point. So, we can keep your change as it follows the Spark doc. Thanks for your help.