sparklyr / sparklyr

R interface for Apache Spark
https://spark.rstudio.com/
Apache License 2.0
955 stars 310 forks source link

How do we implement K-Medoids in Sparklyr [PAM/CLARA : k-means extension]? #2117

Closed yogi-srcc-iimb closed 3 years ago

yogi-srcc-iimb commented 5 years ago

I know that K-Medoids models [ PAM(smaller data)/CLARA(bigger data)] are extentions to k-means clustering models , and are used when data is subject to noise n outliers . I know how do use them in R studio .

I want to replicate same models in sparklyr as well . I did try google search around the same but in vain.

RStudio Code for CLARA model

results<-clara(rd_5, 2, metric = "euclidean", stand = FALSE,samples = 5000, pamLike = FALSE)

I want to write above code in sparklyr . Any ideas or suggestions or solutions are welcome.

benmwhite commented 5 years ago

K-medoids isn't implemented in base Spark, so this won't be trivial. It will take some legwork to build a function that fits that particular model in a single line of code.

I was able to find this Spark package written in Java so it may be possible to write R wrappers for methods here, but the package also hasn't been updated in 2 years.

Another way could be to implement your own version of parallelized k-medoids in R+sparklyr. This is also likely to be time-intensive since it requires a bit of research on top of the actual coding.

yogi-srcc-iimb commented 5 years ago

Pyspark has this solution yet ? Would be a great help if you have link for pyspark for PAM/CLARA algo.

benmwhite commented 5 years ago

As I mentioned it's not in base Spark so you'll have to do the implementation by hand regardless of which Spark API you're using.

EDIT: Apparently Spark devs intentionally excluded it from the base version due to scalability issues from what I can tell.

yogi-srcc-iimb commented 5 years ago

Okay , thanks for response. I am dealing with huge data and k-means not giving a great results but CLARA does that on sample dataset in local R ... Any other work around ?

benmwhite commented 5 years ago

If you just need to try some alternatives quickly then I'd recommend the other clustering algorithms that DO have Spark/sparklyr implementations. While this list is fairly short compared to R's package ecosystem or even just scikit, at least you'll be able to test two of them right away using the same data processing steps you used for ml_kmeans(): ml_bisecting_kmeans(), and ml_gaussian_mixture(). I think ml_lda() will require different processing steps since Spark's implementation is built specifically for document topic modeling applications rather than general clustering.

Edit: If you're dealing with a large number of columns, then maybe you could also use ml_als() or ft_pca() for dimension reduction before the clustering step.