lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
901 stars 105 forks source link

Infinite Euclidean distances when random seed is unlucky number 13 #125

Open slinnarsson opened 3 years ago

slinnarsson commented 3 years ago

When the numpy random seed is set to unlucky number 13, this code snippet:

from pynndescent import NNDescent
X = np.load("X.npy")
np.random.seed(13)
nn = NNDescent(data=X, metric="euclidean", n_jobs=-1)
di, d = nn.query(X, k=50 + 1)
np.all(np.isfinite(d))

...returns False because of 150 inf values in d, but if you set the random seed to lucky number 8 instead, then all distances are finite. The input matrix (X.npy.zip) is all finite with shape (118198, 50).

I like my distances reproducible, hence why I'm setting the random seed. I could use lucky numbers, but it seems likely that 13 is not special and every number is unlucky for some input. I don't think this kind of bad luck can be extremely uncommon; I've hit the bug several times and this particular instance affected one input out of 49 (all of which are N by 50 matrices with N ranging from maybe 10,000 to 2.4 million).

pynndescent.__version__ is '0.5.2' and I'm running Python 3.8.8 in Anaconda 2020.07 on macOS.

Any help is appreciated. As a workaround, I can try again with a new random seed each time I get infinities (and this should still be reproducible) but it would be great to avoid the extra computations.

slinnarsson commented 3 years ago

To clarify, none of the Euclidean distances should actually be infinite, so this is a bug caused somehow by a specific random seed. The input data range X.min(), X.max() is (-1807.786, 3252.2761) so the infinities could hardly be caused by numerical overflow.

di[~np.isfinite(d)] returns

array([113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
       113594, 113594, 113594, 113594, 113594, 113594], dtype=int32)

i.e. all the infinite distances concern the same datapoint. The value of that datapoint X[113594] is

array([ -1.5148002 ,  -0.7143164 ,  -1.202487  ,  -0.7384715 ,
        -0.41573703,  -0.10846981,  -0.5872079 ,  -3.4849496 ,
         0.03345347,  -6.777848  ,  -5.4231877 ,   2.0682294 ,
         0.5331792 ,  13.392006  ,  -3.7574525 , -12.20033   ,
        -0.16876319,   0.72091   ,   2.914828  ,  -1.3504342 ,
        -2.6851077 ,  -0.1541814 ,  -7.2727284 , -20.160751  ,
        15.643392  ,  -6.5628266 ,   4.5133443 , -12.046246  ,
        -0.6132447 ,  -3.1790137 ,   0.42011994,   4.7484884 ,
         3.5867703 ,  -1.6729491 ,   3.965788  ,  -1.1659606 ,
        -0.6355761 ,   1.3691    ,  -0.2475289 ,   3.6624372 ,
        -2.352024  ,   4.2019773 ,   3.6982949 ,  -4.6591644 ,
        -6.585663  ,  -0.05407224,  -4.841457  ,   3.2208655 ,
        -1.519337  ,  -1.0916559 ], dtype=float32)

so there seems to be nothing special about that particular datapoint.

lmcinnes commented 3 years ago

This is ... very peculiar. It certainly isn't obvious to me how this could happen. It is possible to end up with infinite distances in a result if an insufficient number of neighbours are found, but then the indices associated should all be-1 which is not the case here. I'll try to look at this when I get some spare time because I fear this will take some digging. Thanks for managing to get a stripped down reproducer with shareable data -- it will go a long way to tracking down the issue.

slinnarsson commented 3 years ago

Thanks, & thanks for developing pynndescent and UMAP! They are true workhorses for the single-cell analysis community.

Btw - for anyone else who hits this issue - the workaround was simply to increment the random seed and try again; something like this:

if not np.all(np.isfinite(dist)):
    logging.error(f"Some distances were not finite; retrying with new random seed")
    random_seed += 1

That's still deterministic since the same random number seed will reliably create infinite distances on the same input data, thus causing a retry with the next seed.

keunwoochoi commented 1 year ago

not sure if i have the same issue.

in my case - after index.query(), there was quite some case where it returns distance of np.inf or some absurdly large numbers like 325234.. somehow if i query again in such a case, i got a result with reasonable distance (like, 12 or 15 in my 79-dim x 50k dataset).