UMAP results not robust to StratifiedCV

lmcinnes / umap

Uniform Manifold Approximation and Projection

BSD 3-Clause "New" or "Revised" License

7.25k stars 789 forks source link

UMAP results not robust to StratifiedCV #488

Open rxjx opened 3 years ago

rxjx commented 3 years ago

This is something I've been able to reproduce over 3 different datasets, although all 3 are text embeddings. Each case is a binary classification problem where the positive class frequency is roughly 1% and the dataset size is ~ 3-5k. There are ~ 500 features. I've tried multiple parameter combinations for n_neighbors, target_weight, and min_dist. All plots below use the following parameters: (n_neighbors=50, metric='cosine', random_state=42, target_weight=0.9)

If I take all my labeled data and run UMAP in supervised mode I get a nice separation between my positive and negative classes like so:

However, if I hold out 10% of the data using sklearn's StratifiedShuffleSplit and set those labels to -1, the resulting plots look very different, even if I set n_neighbors to 100:

This makes me much less confident about using UMAP as a metric learning tool. Any thoughts as to why this is happening (or more likely what I'm doing wrong) would be greatly appreciated.

rxjx commented 3 years ago

I repeated the above experiment, this time fitting the model with only the training data (no -1 for labels), and then transforming the test data and plotting it as well. Plots show training data as '0' and '1' with test data as '0+' and '1+'. This time I get a closer mapping to the distribution with the full set.

rxjx commented 3 years ago

I guess preserving topological structure doesn't necessarily work well for classification where you want to separate classes as much as possible no matter what the topology.

lmcinnes commented 3 years ago

There are certainly limits to what it can do. The supervision can be effective, but there are issues, especially when the label structure does not line up very well at all with the topological structure of the data. You might be interested in work by Tim Sainburg on using neural networks to learn a parameterised map (https://github.com/timsainb/parametric_umap), including in a semi-supervised case. He had some success with this when using a combination of learned features and using data augmentation.

rxjx commented 3 years ago

Thanks for the link. Could you comment on why there's such a difference between UMAP with -1 as labels for 10% of the data vs fitting on 90% and then transforming that same 10%?

rxjx commented 3 years ago

Also, that link 404s, and a google search for 'timsainb parametric umap' didn't turn up anything.

rxjx commented 3 years ago

Wait, my bad, the search does give me something to look at. I typed it wrong first.

lmcinnes commented 3 years ago

I think the answer to your question is that the transform is, in some sense, imperfect. What I mean by that is that the results of fitting on some data and then transforming the rest is not the same as fitting on the full set. This is because the first case has to hold the original embedding fixed, and can only manipulate the new points, while the second sees the full dataset and adjusts everything accordingly. There is actually even more to it than that because there are internal interaction effects, but that is more complicated to detail here. Merely withholding the labels on 10% of the data doesn't stop UMAP from making use of the actual data itself in the layout (and internal interactions). This can make for quite a different result in the long run.

rxjx commented 3 years ago

Yeah, that was my guess too based on reading some of the other answers in the Issues section. Just wanted confirmation that I wasn't doing something completely wrong or had missed an option or setting. Thanks for answering my questions.