How to prevent overfitting in supervised mode?

lmcinnes / umap

Uniform Manifold Approximation and Projection

BSD 3-Clause "New" or "Revised" License

7.38k stars 802 forks source link

How to prevent overfitting in supervised mode? #148

Open miclegr opened 5 years ago

miclegr commented 5 years ago

Using standard parameters in supervised mode with dichotomous response variable (0/1) and embedding in 2 dimensions, I fit on 80% of data (800k points,20 cols) and I transform on 80% and on the holdout 20% of data.

This are the results I'm getting:

As you can see for training data I get a clear separation between zeros and ones but in the test dataset no clear picture is produced. This is clearly an evidence of overfitting (for reference fitting xgboost on the same dataset yields an AUC of 0.8).

Which are the parameters I can tweak in order to avoid this effect?

Thanks for the support, Michele.

lmcinnes commented 5 years ago

I don't believe you'll manage to completely alleviate the problems given the results you are seeing. One way to think about this is that if a KNN-Classifier struggles to perform well on your data then a supervised UMAP will similarly struggle. I suspect a KNN-Classifier does not do well on that dataset.

That being said, you can mitigate things a little. One option is to reduce the weight that UMAP puts on labelling. The relevant parameter is target_weight. The default value is 0.5, which balances roughly equally between data and label in importance. You can reduce the value to give more weight to the data representation (which will reduce how well the separation occurs in the training set). Another alternative is to mask some labels for training (assign a random selection of labels the value -1 for 'unknown').

miclegr commented 5 years ago

Thank you for the prompt answer! Adjusting target_weight should definitely improve the situation. One thing I forgot to mention is that the classes are unbalanced (13%/87%), I'll try also to balance them, fit, and than predict on the original dataset. Let's see what's happens.

abisi commented 5 years ago

Hello @miclegr,

I am also, I think, dealing with the problem of overfitting in supervised UMAP. You said you'd try changing the target_weight parameter, how did that work for you ? To me (or to my data set), this does not solve the issue at all when I reduce the parameter.

lmcinnes commented 5 years ago

A target_weight of 0.0 should be equivalent to not using the target data at all; in what sense is it overfitting? Potentially your problem is not necessarily amenable to this sort of approach, which definitely cares about k-neighbor structure.

Clyde-fare commented 4 years ago

Hi, I'm also not getting the behaviour I expect from target_weight. It does seem to alter the distribution of the targets but setting it to 0.0 does not appear to be the same as not using target data (using v0.3.10). Even with target_weight of 0.0 I still see significant separation based on targets (which I don't get if I omit the targets during the call to fit_transform). Any ideas?

lmcinnes commented 4 years ago

Due to how things had to be implemented to make it tractable to compute, the 0.0 indeed does not act purely as if there were no target weight. It is, however, as close as one can get and still relate to the taregt, so I'm not sure how to do any better with the current approach. Sorry.

Clyde-fare commented 4 years ago

Ok understood, thanks for replying for so quickly. Amazing library btw. 👍

davidhaslacher commented 1 year ago

Hello @lmcinnes, I have a dataset where a KNN-Classifier performs well before applying UMAP, but poorly after applying UMAP. This is the case even when I reduce the target_weight parameter. Do you have some other suggestions for things I could try? Thanks for this great library!