lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.24k stars 787 forks source link

How to handle categorical variables in Parametric UMAP? #873

Open SoulEvill opened 2 years ago

SoulEvill commented 2 years ago

First of all, thank you so much for releasing the Parametric UMAP it works like dream. now I am able to project the unseen data in just a sec and it really helps with my use case.

Currently, I have a mixed dataset and I simply use the one-hot for categorical variables, I have read some old posts before parametric UMAP is available, the approach is to separate the numerical and categorical features into two and one use jaccard/dice distance metrics then combine together. I am wondering if that's still the best way to go with the Parametric UMAP. or if there is a way that we can fit the embedding through umap_loss.

thanks in advance!

lmcinnes commented 2 years ago

I would get @timsainb to weigh on more detailed aspects. PramatericUMAP doesn't support combining models in the same way, so you will need to simply come up with a reasonable distance metric across the combined data. I would suggest that part of the answer essentially lies in the fact that you can design and use whatever architecture of neural network you want within ParametricUMAP. For example, I know Tim used convolutional networks specifically for the image datasets, and RNNs for some of the other sequence type datasets. That means that whatever style of network would work best for the mixed data, and that will help with the optimization phase.

As to what matric to use for at least handling the distance computation part? Perhaps some variant of Gower distance would work well enough?