src-d / kmcuda

Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
Other
806 stars 145 forks source link

Feature Request: K-Prototypes or similar implementation for clustering mixed data #118

Open plutonium-239 opened 3 years ago

plutonium-239 commented 3 years ago

I would love to have an optimised (or at least CUDA) implementation of the K-Prototypes algorithm (package that I use: kmodes, since a lot of data science deals with categorical data, and it would be great if I don't have to use TargetEncoders or worse, pd.get_dummies() for categorical data with a lot of categories. Right now, the solution that I use is using a TargetEncoder on the categorical variables and then using the kmeans/knn in this package, which I feel is a little 'fix'-ey, because of numerical data being continuous and having some relations, whereas it is not necessary for the categorical variables to have any relations (greater than/less than)