Incorrect structure on random points for data with small variations. Lack of shuffling?

umap.__version__ == 0.4.6

In the following example, UMAP is applied to random data with tiny variations around a fixed value. Arbitrary labels are assigned to the data in order ('a' for the first 150 samples, 'b' to the next 150, etc.).

When applying UMAP, a structure appears (points with the same label are grouped), as shown in the image below, but it should not (random data). This issue can potentially lead to incorrect interpretations.

It could be due to a kind of greedy behavior in the construction of the neighborhood graph. Randomly shuffling the data before construction of the neighborhood graph solves the problem, but might not be an ideal fix.

import numpy as np
from umap import UMAP
import umap.plot

n_obs = 600
n_vars = 16
np.random.seed(42)

fixed_value = np.random.random(n_vars)
X = 10 * fixed_value + 1e-9 * np.random.random((n_obs, n_vars))
labels = np.array(['a'] * (n_obs // 4)
                  + ['b'] * (n_obs // 4)
                  + ['c'] * (n_obs // 4)
                  + ['d'] * (n_obs // 4))

reducer = UMAP(
    metric="cosine",
    min_dist=0.0,
    n_neighbors=10,
    random_state=0,
)
reducer.fit(X)
p = umap.plot.points(reducer, labels=labels, theme="fire")
umap.plot.show(p)

Screenshot from 2022-03-02 15-25-55

In the sample code, the rows of X are basically identical, at least up to the 8 decimal places that my Jupyter Lab notebook displays, so the small variation in this simulation data might be a bit too small. Is this level of numerical similarity based on real-world data?

Anyway, what you are seeing is the nearest neighbor routine failing in the event that it can't distinguish the distances between any observations (the distances are all zero). What happens under those conditions as to what neighbors get picked is going to be very implementation-dependent. In an ideal world picking at random would be better than what we get, but I don't see this even with other commonly used nearest neighbor methods (e.g. Annoy, HNSW, the exact nearest neighbors R package FNN).

In my experience this scenario doesn't happen with non-simulation data. What I have observed are small subsets of the data being exact duplicates. Degeneracy of distances can be a bigger risk with low-dimensional binary data with integer-based distances because there are a limited number of possible distances. But I don't know how common such datasets are and wouldn't result in the full-on catastrophe of all observations being effectively duplicates.

Assuming this does affect real data, there isn't an easy fix because as mentioned other nearest neighbor routines also fail in a similar way (see https://github.com/jlmelville/uwot/issues/85 for instance). Diagnosing when the issue is likely to occur is probably the best that can be done. Something that might help would be to log the number of times the binary search in smooth_knn_dist fails. I have this in uwot and it invariably indicates the presence of duplicates. It might not be too hard to come up with a PR to implement this here too.

(as an aside for anyone reading this, I recommend the recently released Avonet bird morphology dataset for a fun dataset where you will see some duplicates. Seems there are many parrot species with identical beak/wing measurements. The resulting embedding is not unduly affected however)

Thanks for your answer! I actually encountered the bug on real data, or, more precisely, on embeddings generated via machine learning from real data. The learning algorithm was a variant of contrastive learning, which tends to make embeddings of similar samples "collapse" (without being identical). So it is not unusual to have neighborhoods of embeddings in which the different points are extremely close to each other (with distances around 1e-10). It really took me some time to understand what was happening!

Thanks for the information that other nearest neighbor routines would also fail. I'll remember to either diagnose the "dangerous" cases, as you mentioned, or to shuffle the data before computing neighborhood graphs. I think that a warning, either when very small non-zero distances are detected, or, as you suggested, when multiple binary searches fail, would be a way to prevent other people from encountering the same bug.

We typically deal with the nearest neighbour problem with regards to duplicate data with a UMAP flag unique=True. This will dedup your data before computing our nearest neighbour graph and then embed all the duplicate copies of your data on the same coordinate in your low dimensional space. As James correctly pointed out this problem is commonly encountered with binary data derived from categorical features. Duplicate count_vectorized text data was the original use case that tripped me up with respect to this property.

You could always round your data to some small precision (such as 1e-10) and then use unique=True. Of course, that doesn't help with detecting the problem in the first place. I'm not sure that this problem is common enough to warrant a warning but I could be convinced otherwise at which point a PR would be lovely.

On Thu, Mar 3, 2022 at 1:22 PM Nicolas Perrin-Gilbert < @.***> wrote:

Thanks for your answer! I actually encountered the bug on real data, or, more precisely, on embeddings generated via machine learning from real data. The learning algorithm was a variant of contrastive learning, which tends to make embeddings of similar samples "collapse" (without being identical). So it is not unusual to have neighborhoods of embeddings in which the different points are extremely close to each other (with distances around 1e-10). It really took me some time to understand what was happening!

Thanks for the information that other nearest neighbor routines would also fail. I'll remember to either diagnose the "dangerous" cases, as you mentioned, or to shuffle the data before computing neighborhood graphs. However, a warning when very small non-zero distances are detected would, I think, be a way to prevent other people from encountering the same bug.

— Reply to this email directly, view it on GitHub https://github.com/lmcinnes/umap/issues/840#issuecomment-1058351183, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3IUWX6IHUJA7LZ65XQUNDU6D7NFANCNFSM5PXT4MBQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

lmcinnes / umap

Incorrect structure on random points for data with small variations. Lack of shuffling? #840