radanalyticsio / silex

something to help you spark
Apache License 2.0
65 stars 13 forks source link

MedoidClustering - provides kMedoids function #3

Closed erikerlandson closed 9 years ago

erikerlandson commented 9 years ago

Interested in thoughts about interface on this one:

0) maybe best to use the style of interface used by the Spark KMeans, where it's a modeling object and you use builder pattern for parameters, obj.train(rdd), etc

1) should it be kMedoids(rdd, ...) or rdd.kMedoids(...)? 2) should I add a variation that operates on Scala sequences instead of RDD? 3) never decided what the return type should be. maybe some kind of MedoidClustering case class?

rnowling commented 9 years ago

0) I vote for using the Spark clustering interface style: an object for setting parameters and creating an instance of a separate clustering class. Users may want to try different clusterings and keep those models around -- the current object approach prevents that.

1) I vote for keeping K Medoids as a separate thing that operates on an RDD instead of adding K-Medoids as an implicit functoin. RDD.kMedoids() feels more coupled than KMedoids(RDD) to me.

2) Not sure -- for now, we can create an RDD from a Scala sequence if we need it. If it becomes something we use frequently, it may be worth the effort.

3) I suggest returning a model that can do assignments and return the medoids for each cluster.

rnowling commented 9 years ago

Could you add some documentation on the APIs?

willb commented 9 years ago

I agree with @rnowling on the interface; implicits make more sense for transformations that return a new RDD than for trainers that return a model.

erikerlandson commented 9 years ago

github created a new PR when I renamed the branch: #19 I'm closing this one