Closed yogi-srcc-iimb closed 3 years ago
K-medoids isn't implemented in base Spark, so this won't be trivial. It will take some legwork to build a function that fits that particular model in a single line of code.
I was able to find this Spark package written in Java so it may be possible to write R wrappers for methods here, but the package also hasn't been updated in 2 years.
Another way could be to implement your own version of parallelized k-medoids in R+sparklyr. This is also likely to be time-intensive since it requires a bit of research on top of the actual coding.
Pyspark has this solution yet ? Would be a great help if you have link for pyspark for PAM/CLARA algo.
As I mentioned it's not in base Spark so you'll have to do the implementation by hand regardless of which Spark API you're using.
EDIT: Apparently Spark devs intentionally excluded it from the base version due to scalability issues from what I can tell.
Okay , thanks for response. I am dealing with huge data and k-means not giving a great results but CLARA does that on sample dataset in local R ... Any other work around ?
If you just need to try some alternatives quickly then I'd recommend the other clustering algorithms that DO have Spark/sparklyr implementations. While this list is fairly short compared to R's package ecosystem or even just scikit, at least you'll be able to test two of them right away using the same data processing steps you used for ml_kmeans()
: ml_bisecting_kmeans()
, and ml_gaussian_mixture()
. I think ml_lda()
will require different processing steps since Spark's implementation is built specifically for document topic modeling applications rather than general clustering.
Edit: If you're dealing with a large number of columns, then maybe you could also use ml_als()
or ft_pca()
for dimension reduction before the clustering step.
I know that K-Medoids models [ PAM(smaller data)/CLARA(bigger data)] are extentions to k-means clustering models , and are used when data is subject to noise n outliers . I know how do use them in R studio .
I want to replicate same models in sparklyr as well . I did try google search around the same but in vain.
RStudio Code for CLARA model
results<-clara(rd_5, 2, metric = "euclidean", stand = FALSE,samples = 5000, pamLike = FALSE)
I want to write above code in sparklyr . Any ideas or suggestions or solutions are welcome.