jasp-stats / jasp-issues

This repository is solely meant for reporting of bugs, feature requests and other issues in JASP.
58 stars 29 forks source link

[Feature Request]: extend clustering algorithms to take account of categorical data and mixed data #2822

Open TarandeepKang opened 1 month ago

TarandeepKang commented 1 month ago

Description

No response

Purpose

Improve the range of data types that can be clustered

Use-case

No response

Is your feature request related to a problem?

Currently mixed categorical data cannot be clustered using Jasp

Is your feature request related to a JASP module?

Machine Learning

Describe the solution you would like

k-prototypes clustering and Gower distances

Describe alternatives that you have considered

No response

Additional context

k-prototypes clustering (Huang) using the clustmixtype package as well as perhaps Gower distances (gower package) and I include a few reviews of the wide variety of other methods.

Ahmad, A., & Khan, S. S. (2019). Survey of State-of-the-Art Mixed Data Clustering Algorithms. IEEE Access, 7, 31883–31902. https://doi.org/10.1109/ACCESS.2019.2903568 Gower, J. C. (1971). A General Coefficient of Similarity and Some of Its Properties. Biometrics, 27(4), 857–871. https://doi.org/10.2307/2528823 Huang, Z. (1998). Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values. Data Mining and Knowledge Discovery, 2(3), 283–304. https://doi.org/10.1023/A:1009769707641 Hunt, L., & Jorgensen, M. (2011). Clustering mixed data. WIREs Data Mining and Knowledge Discovery, 1(4), 352–361. https://doi.org/10.1002/widm.33 McParland, D., & Gormley, I. C. (2016). Model based clustering for mixed data: clustMD. Advances in Data Analysis and Classification, 10(2), 155–169. https://doi.org/10.1007/s11634-016-0238-x Szepannek, G. (2018). clustMixType: User-Friendly Clustering of Mixed-Type Data in R. The R Journal, 10(2), 200–208. van de Velden, M., Iodice D’Enza, A., & Markos, A. (2019). Distance-based clustering of mixed data. WIREs Computational Statistics, 11(3), e1456. https://doi.org/10.1002/wics.1456

patc3 commented 1 month ago

Regarding Gower distances: I usually calculate Gower distances and input those into k-medoids, both using the cluster package in R, something like this:

distances <- cluster::daisy(x = df, metric = "gower")
cl<-cluster::pam(x=distances, k=k, diss=TRUE) # k is number of clusters
TarandeepKang commented 1 month ago

Oh yes, the daisy function in the cluster package is another way to get at Gower distances! Since you already use other functions from cluster elsewhere, that might be preferable.