Open sungreong opened 5 years ago
Sorry, I'm not quite clear on what you are asking. Can you give me an example of what your data looks like and what you are trying to do? A solution may be possible, but I'm not sure.
Thank you for your reply!! For example, If I have a 'Location' variable in my data, I would like to reduce the size of this variable because the cardinality of this variable is too large. I want to make it like this. one-hot encoding [N ,49] -> embedding [N, 2] (I don't like OnehotEncoding)
Is it possible? I want to know the reason for it. When I tested it, I got the following picture.
If it is just one variable then I don't think that is going to achieve very much -- you effectively have a discrete topological space where every element is equidistant from every other element, and there isn't anything sensible UMAP can do with that. If you were trying to fold together a decent number of distinct categorical variables then you could try one-hot encoding and using metric='dice'
.
Thank you for your reply!! So if you put multiple categorical or numeric variables together, can you create a new derived variable because it has a feature combination effect? Thank you.
Yes it's looking at how the several different categorical variables correlate between samples, and that gives it enough traction to have something to work with.
Thanks!! I have two experiments One is to fit the whole data (fit (x, y)) and split the train and test, The performance was 91% when put in xgboost and the other one is to embed the train data by dividing the train and test. Then when embedding the test with the transform, the performance was 80%. Is this related to overfitting?
######## first case performance 90% ##############
embed = umap.UMAP(n_components= 40 ,
n_neighbors=30,
min_dist=0.9 ,
metric ="hamming").fit_transform(total_x , total_y)
X_train, X_test, y_train, y_test = train_test_split(
embedding , total_y, test_size=0.2, random_state=42)
gb_param = {'max_depth':7, 'objective':'binary:logistic',
'learning_rate':0.001 ,'tree_method':'gpu_hist'}
xgb_model = xgb.XGBClassifier(**xgb_param)
xgb2 = xgb_model.fit(X_train, y_train)
xgbpredictions = xgb2.predict(X_test)
xgbprob = xgb2.predict_proba(X_test)[:,1]
############## second case performance 80% ##########
X_train, X_test, y_train, y_test = train_test_split(
total_x , total_y, test_size=0.2, random_state=42)
mapper= umap.UMAP(n_components= 40 ,
n_neighbors=30,
min_dist=0.9 ,
metric ="hamming").fit(X_train, y_train)
train_embedding = mapper.transform(X_train)
test_embedding = mapper.transform(X_test)
xgb_param = {'max_depth':7, 'objective':'binary:logistic',
'learning_rate':0.001 ,'tree_method':'gpu_hist'}
xgb_model = xgb.XGBClassifier(**xgb_param)
xgb2 = xgb_model.fit(train_embedding, y_train)
xgbpredictions = xgb2.predict(test_embedding)
xgbprob = xgb2.predict_proba(test_embedding)[:,1]
This doesn't surprise me. You are in effect fitting your model to your test data in the first case.
You are using UMAP in a supervised way -- giving it the y-coordinates. This means that UMAP will try to put points with the same target value close together, and points with different target values far apart. (It uses both the metric on the x-values and the distance between the y-values.)
So when you do this embedding, to some approximation, you are saying "give me an embedding such that the sets of points with different y-values are separable". In the first case, when you embed using the test data as well, you are basically ensuring that you will later on be able to separate the different y-valued test points. In the second case, UMAP doesn't "see" the test data, so as we see produces a model that doesn't separate the test points nearly as well.
I would recommend either fitting only using the training data, or if you really want to look at the structure of all your data, when you give UMAP the y-labels, replace the target values for your validation data with -1 to indicate no label. (See this section of the docs.)
Thank you for your reply!!
So I Visualize embeddings If you look below, the train will give you the label and you will learn to separate it well. But when you apply the model to the test, you get a lot of different things, unlike what I expected. So, it seems to make overfiting in tree-based models. I simply want to use this embedding as a new derived variable, not to use it for clustering, but I became skeptical about this approach. Is my opinion wrong?
Ultimately supervised UMAP will be similar in some ways to a KNN classifier. If your data is such that a KNN-classifier can't do well, then supervised UMAP is not going be the best choice. That potentially seems to be the case with your data here.
I want to project the category variable to low dimension. I found umap after looking for various things. Is it reasonable to use an input with only one category variable and target value embedded in it?