lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.35k stars 799 forks source link

UMAP Resistance to Scale Changes, Data Drift #969

Open dewball345 opened 1 year ago

dewball345 commented 1 year ago

Hi, apologies if this is obvious, but I wanted to ask how resistant UMAP was to differences in scale and offset. Ex, if our train and test datasets are slightly differently positioned(due to some slight data drift), how will that affect UMAP.

I conducted a simple experiment where I check UMAP's projections after drifting the data slightly in order to see how it is changed. I used sklearn's make_blobs function to create three clusters with 3 features(3d space). I then used UMAP to compress them into 2 components. I ran the transform function for both training and testing data as a baseline, and the results were expected. however, after shifting the test data by an offset, i found that, while there were still 3 clusters, the classes in each cluster were different:

No offset of test data: image UMAP projections: image

Offset of 1(adding 1 to the test data): image UMAP projections: image

Offset of 2(adding 2 to the test data): image UMAP Projections: image

While the relative distances between classes are the same, and there are still 3 clusters in similar positions, it seems like the data in each cluster changes. This is a little surprising to me and I thought that UMAP would be a little more resistant to scale changes(keep in mind though that i am only showing the first two features when there are three, however the point still stands) -- doesn't umap utilize graph-based methods to learn the manifold? Shouldn't it pay attention to the distribution of the test data rather than making static thresholds? Let me know if i am understanding this correctly.

The use-case for this is during cases with slight train-test drift(ex data is highly variable and while the distribution of clusters is similar it can shift from time to time)

Here is a colab: https://colab.research.google.com/drive/1NyBNSxa81skCMDkOGfpVeHt5Nj3awkjH?usp=sharing

jlmelville commented 1 year ago

Shouldn't it pay attention to the distribution of the test data

Transforming new data only uses the embedded training data. So even if you pass a large batch of test data in one call, UMAP does not make use of the distribution of that data, each test data instance is treated independently of the other.

dewball345 commented 1 year ago

@jlmelville Got it. Is there any way, though, to make UMAP robust against these distribution changes(besides, say, normalization)?

jlmelville commented 1 year ago

The fundamental issue is that the data has shifted sufficiently so the resulting k-nearest neighbor graph that is being embedded is now different enough from the original data to give a visually different result.

So I think ultimately if this is a concern the raw knn graph isn't the right graph to be embedding. Maybe processing the knn graph would help here -- the sort of things discussed in Clustering with UMAP: Why and How Connectivity Matters would be a place to start, but that technique itself probably isn't exactly what you need, so I don't think there is an out of the box solution. What happens if you use a larger values for n_neighbors? That might help, but may also slow things down and give a layout that is not what you want. Sorry I don't have any better ideas.

dewball345 commented 1 year ago

Thanks for your response and help on this matter. Could you elaborate on what "the resulting k-nearest neighbor graph that is being embedded is now different enough from the original data to give a visually different result." means? Do you mean that the KNN graph kind of deals in an absolute space, and when we shift the data it kind of goes over the boundaries set by the KNN(not sure if that's how the algorithm works). I'm a little confused about this because the KNN is based on the distances between points, however not sure if that applies to the test data

jlmelville commented 1 year ago

For the test data, the knn graph that generates the layout uses nearest neighbors in the training data only. When you shift the test data, which training data points are considered near neighbors could (probably does) change.

Also yes because the distances between the test and training data points has changed, that would also affect the layout even if the identity of the nearest neighbors stayed the same for the unshifted and shifted test set cases. The effect would be subtle, but the distances are used to generate the resulting similarities which in turn determines how often some edges are sampled. However, I don't think this has a large effect usually.

I may be misunderstanding your question so apologies if this isn't what you are asking about.

dewball345 commented 1 year ago

Okay got it, that makes sense. This was just an example to reproduce this issue, however in practice, this wasn't too big an issue for me either.

Just to summarize what I think you are saying - we "train" a nearest-neighbors graph-based model on our training data that is based on the distances/distribution between the points. And when we apply the test data, we use the same knn model(so in essence we "compare" the test points to the train distribution). If the test distribution changes, the distances between the test points and the train points in the KNN change, which affects the resulting graph structure.

Thanks again, will check the paper and ask any questions as needed(will close this issue if resolved). I think a solution that could work would be to construct a graph for the test dataset, then try to "match" it with the training graph before applying a transformation, if that makes sense. Basically want to know if there's any good way to "match" testing and training distributions somehow (kind of like a calibration)

jlmelville commented 1 year ago

Basically want to know if there's any good way to "match" testing and training distributions somehow (kind of like a calibration)

If the cluster identities could be used as labels consistently for both training and test data, then you could use supervised UMAP to ameliorate the effect of the drift? Maybe?