Open rxjx opened 3 years ago
I repeated the above experiment, this time fitting the model with only the training data (no -1 for labels), and then transforming the test data and plotting it as well. Plots show training data as '0' and '1' with test data as '0+' and '1+'. This time I get a closer mapping to the distribution with the full set.
I guess preserving topological structure doesn't necessarily work well for classification where you want to separate classes as much as possible no matter what the topology.
There are certainly limits to what it can do. The supervision can be effective, but there are issues, especially when the label structure does not line up very well at all with the topological structure of the data. You might be interested in work by Tim Sainburg on using neural networks to learn a parameterised map (https://github.com/timsainb/parametric_umap), including in a semi-supervised case. He had some success with this when using a combination of learned features and using data augmentation.
Thanks for the link. Could you comment on why there's such a difference between UMAP with -1 as labels for 10% of the data vs fitting on 90% and then transforming that same 10%?
Also, that link 404s, and a google search for 'timsainb parametric umap' didn't turn up anything.
Wait, my bad, the search does give me something to look at. I typed it wrong first.
I think the answer to your question is that the transform is, in some sense, imperfect. What I mean by that is that the results of fitting on some data and then transforming the rest is not the same as fitting on the full set. This is because the first case has to hold the original embedding fixed, and can only manipulate the new points, while the second sees the full dataset and adjusts everything accordingly. There is actually even more to it than that because there are internal interaction effects, but that is more complicated to detail here. Merely withholding the labels on 10% of the data doesn't stop UMAP from making use of the actual data itself in the layout (and internal interactions). This can make for quite a different result in the long run.
Yeah, that was my guess too based on reading some of the other answers in the Issues section. Just wanted confirmation that I wasn't doing something completely wrong or had missed an option or setting. Thanks for answering my questions.
This is something I've been able to reproduce over 3 different datasets, although all 3 are text embeddings. Each case is a binary classification problem where the positive class frequency is roughly 1% and the dataset size is ~ 3-5k. There are ~ 500 features. I've tried multiple parameter combinations for n_neighbors, target_weight, and min_dist. All plots below use the following parameters: (n_neighbors=50, metric='cosine', random_state=42, target_weight=0.9)
If I take all my labeled data and run UMAP in supervised mode I get a nice separation between my positive and negative classes like so:![image](https://user-images.githubusercontent.com/8461845/91910161-57f7da80-ec63-11ea-982a-4beddcca6e2a.png)
However, if I hold out 10% of the data using sklearn's StratifiedShuffleSplit and set those labels to -1, the resulting plots look very different, even if I set n_neighbors to 100:![image](https://user-images.githubusercontent.com/8461845/91910785-8629ea00-ec64-11ea-9f2a-1683fd0525d2.png)
This makes me much less confident about using UMAP as a metric learning tool. Any thoughts as to why this is happening (or more likely what I'm doing wrong) would be greatly appreciated.