lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.39k stars 803 forks source link

How to project numerical and categorical data? #104

Open RaulRomaniF opened 6 years ago

RaulRomaniF commented 6 years ago

I want to project the Titanic dataset it contains categorical and numerical data?

I heard in this video (min 48:08), basically says that UMAP can combine multiple data types, so the question is how?

One approach would be to project numerical data only and then categorical data only and finally combine them in the same space. But Is that approach the way to go?

Thank you for your time.

lmcinnes commented 6 years ago

This can be done in theory; in practice I am still working on the code to do this, so it isn't available in the repository yet. This may not be the answer you are looking for. As in interim step you can check issue #58 which provides a simple recipe to do this in straightforward cases.

On Sun, Aug 5, 2018 at 1:08 AM romanics notifications@github.com wrote:

I want to project the Titanic dataset https://www.kaggle.com/c/titanic/data it contains categorical and numerical data?

I heard in this video https://www.youtube.com/watch?v=YPJQydzTLwQ (min 48:08), basically says that UMAP can combine multiple data types, so the question is how?

One approach would be to project numerical data only and then categorical data only and finally combine them in the same space. But Is that approach the way to go?

Thank you for your time.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/104, or mute the thread https://github.com/notifications/unsubscribe-auth/ALaKBdRR_uPa94E19J3HWL1wO3R0LbOvks5uNn3EgaJpZM4VvPjc .

asdspal commented 5 years ago

The data with both categorical and numerical data types can be handled using gower-distance metric. You can download the code for gower distance metric from here. It might be available in coming scikit-learn release.

lmcinnes commented 5 years ago

While Gower distance is quite useful it is also somewhat heuristic. I would recommend exploring it as one of the options for handling mixed continuous and categorical data.

acilingi commented 2 years ago

Is it possible to create indicator variables from categorical variables.

lmcinnes commented 2 years ago

One approach is pd.get_dummies, but you may also want to look at the dirty-cat library for richer options.