sramirez / spark-MDLP-discretization

Spark implementation of Fayyad's discretizer based on Minimum Description Length Principle (MDLP)
Apache License 2.0
44 stars 27 forks source link

Issue 12 stopping criteria bb #15

Closed barrybecker4 closed 8 years ago

barrybecker4 commented 8 years ago

This PR adds the stoppingCriteria param. Other changes: Added ThresholdFinder classes to make it clear that there are two alternative strategies for finding thresholds. Since one used RDDs and the other Arrays, I could not get the api to be the same or make as much code as I wanted in common, but hopefully its an improvement. I set the version to 0.2-SNAPSHOT in build.scala to prepare for the next release (which I assume is 0.2). Package is now org.apaphe.spark instead of just org.apache.

sramirez commented 8 years ago

I've just reviewed all the commits and I have to say it's a really nice work. Thanks for your support.

About this topic:

"There was no performance (or correctness) difference between using if (k1 > 0) leftFreqs.sum else 0 and just leftFreqs.sum."

I just remember that it was motivated by a performance matter. But it seems useless, and actually I didn't even check the impact of this line.

barrybecker4 commented 8 years ago

Thanks. I have another question. There were two places that used ".clone". I removed the ".clone" because I don't think it was needed. The tests passed without it. Can you recall why you added it? I don't want to introduce a bug. Used to be

entropyFreqs = (cand, freq, leftAccum.clone(), rightTotal) +: entropyFreqs
now is

entropyFreqs = (cand, freq, leftAccum, rightTotal) +: entropyFreqs