lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.45k stars 808 forks source link

Can I use it to embed categorical variables as numeric variables? #241

Open sungreong opened 5 years ago

sungreong commented 5 years ago

I want to project the category variable to low dimension. I found umap after looking for various things. Is it reasonable to use an input with only one category variable and target value embedded in it?

lmcinnes commented 5 years ago

Sorry, I'm not quite clear on what you are asking. Can you give me an example of what your data looks like and what you are trying to do? A solution may be possible, but I'm not sure.

sungreong commented 5 years ago

Thank you for your reply!! For example, If I have a 'Location' variable in my data, I would like to reduce the size of this variable because the cardinality of this variable is too large. I want to make it like this. one-hot encoding [N ,49] -> embedding [N, 2] (I don't like OnehotEncoding)

Is it possible? I want to know the reason for it. When I tested it, I got the following picture. image

image

lmcinnes commented 5 years ago

If it is just one variable then I don't think that is going to achieve very much -- you effectively have a discrete topological space where every element is equidistant from every other element, and there isn't anything sensible UMAP can do with that. If you were trying to fold together a decent number of distinct categorical variables then you could try one-hot encoding and using metric='dice'.

sungreong commented 5 years ago

Thank you for your reply!! So if you put multiple categorical or numeric variables together, can you create a new derived variable because it has a feature combination effect? Thank you.

lmcinnes commented 5 years ago

Yes it's looking at how the several different categorical variables correlate between samples, and that gives it enough traction to have something to work with.

sungreong commented 5 years ago

Thanks!! I have two experiments One is to fit the whole data (fit (x, y)) and split the train and test, The performance was 91% when put in xgboost and the other one is to embed the train data by dividing the train and test. Then when embedding the test with the transform, the performance was 80%. Is this related to overfitting?

######## first case performance 90% ##############
embed = umap.UMAP(n_components= 40 ,
                  n_neighbors=30,
                  min_dist=0.9 ,
                  metric ="hamming").fit_transform(total_x , total_y) 
X_train, X_test, y_train, y_test = train_test_split(
     embedding ,  total_y, test_size=0.2, random_state=42)
gb_param = {'max_depth':7, 'objective':'binary:logistic', 
             'learning_rate':0.001 ,'tree_method':'gpu_hist'}
xgb_model = xgb.XGBClassifier(**xgb_param)
xgb2 = xgb_model.fit(X_train, y_train)
xgbpredictions = xgb2.predict(X_test)
xgbprob = xgb2.predict_proba(X_test)[:,1]
############## second case performance  80% ##########
X_train, X_test, y_train, y_test = train_test_split(
     total_x ,  total_y, test_size=0.2, random_state=42)
mapper= umap.UMAP(n_components= 40 ,
                  n_neighbors=30,
                  min_dist=0.9 ,
                  metric ="hamming").fit(X_train, y_train)
train_embedding = mapper.transform(X_train)
test_embedding = mapper.transform(X_test)
xgb_param = {'max_depth':7, 'objective':'binary:logistic', 
             'learning_rate':0.001 ,'tree_method':'gpu_hist'}
xgb_model = xgb.XGBClassifier(**xgb_param)
xgb2 = xgb_model.fit(train_embedding, y_train)
xgbpredictions = xgb2.predict(test_embedding)
xgbprob = xgb2.predict_proba(test_embedding)[:,1]
adelejackson commented 5 years ago

This doesn't surprise me. You are in effect fitting your model to your test data in the first case.

You are using UMAP in a supervised way -- giving it the y-coordinates. This means that UMAP will try to put points with the same target value close together, and points with different target values far apart. (It uses both the metric on the x-values and the distance between the y-values.)

So when you do this embedding, to some approximation, you are saying "give me an embedding such that the sets of points with different y-values are separable". In the first case, when you embed using the test data as well, you are basically ensuring that you will later on be able to separate the different y-valued test points. In the second case, UMAP doesn't "see" the test data, so as we see produces a model that doesn't separate the test points nearly as well.

I would recommend either fitting only using the training data, or if you really want to look at the structure of all your data, when you give UMAP the y-labels, replace the target values for your validation data with -1 to indicate no label. (See this section of the docs.)

sungreong commented 5 years ago

Thank you for your reply!!

So I Visualize embeddings If you look below, the train will give you the label and you will learn to separate it well. But when you apply the model to the test, you get a lot of different things, unlike what I expected. So, it seems to make overfiting in tree-based models. I simply want to use this embedding as a new derived variable, not to use it for clustering, but I became skeptical about this approach. Is my opinion wrong?

image

lmcinnes commented 5 years ago

Ultimately supervised UMAP will be similar in some ways to a KNN classifier. If your data is such that a KNN-classifier can't do well, then supervised UMAP is not going be the best choice. That potentially seems to be the case with your data here.