Open slinnarsson opened 3 years ago
To clarify, none of the Euclidean distances should actually be infinite, so this is a bug caused somehow by a specific random seed. The input data range X.min(), X.max()
is (-1807.786, 3252.2761)
so the infinities could hardly be caused by numerical overflow.
di[~np.isfinite(d)]
returns
array([113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594, 113594, 113594,
113594, 113594, 113594, 113594, 113594, 113594], dtype=int32)
i.e. all the infinite distances concern the same datapoint. The value of that datapoint X[113594]
is
array([ -1.5148002 , -0.7143164 , -1.202487 , -0.7384715 ,
-0.41573703, -0.10846981, -0.5872079 , -3.4849496 ,
0.03345347, -6.777848 , -5.4231877 , 2.0682294 ,
0.5331792 , 13.392006 , -3.7574525 , -12.20033 ,
-0.16876319, 0.72091 , 2.914828 , -1.3504342 ,
-2.6851077 , -0.1541814 , -7.2727284 , -20.160751 ,
15.643392 , -6.5628266 , 4.5133443 , -12.046246 ,
-0.6132447 , -3.1790137 , 0.42011994, 4.7484884 ,
3.5867703 , -1.6729491 , 3.965788 , -1.1659606 ,
-0.6355761 , 1.3691 , -0.2475289 , 3.6624372 ,
-2.352024 , 4.2019773 , 3.6982949 , -4.6591644 ,
-6.585663 , -0.05407224, -4.841457 , 3.2208655 ,
-1.519337 , -1.0916559 ], dtype=float32)
so there seems to be nothing special about that particular datapoint.
This is ... very peculiar. It certainly isn't obvious to me how this could happen. It is possible to end up with infinite distances in a result if an insufficient number of neighbours are found, but then the indices associated should all be-1
which is not the case here. I'll try to look at this when I get some spare time because I fear this will take some digging. Thanks for managing to get a stripped down reproducer with shareable data -- it will go a long way to tracking down the issue.
Thanks, & thanks for developing pynndescent and UMAP! They are true workhorses for the single-cell analysis community.
Btw - for anyone else who hits this issue - the workaround was simply to increment the random seed and try again; something like this:
if not np.all(np.isfinite(dist)):
logging.error(f"Some distances were not finite; retrying with new random seed")
random_seed += 1
That's still deterministic since the same random number seed will reliably create infinite distances on the same input data, thus causing a retry with the next seed.
not sure if i have the same issue.
in my case - after index.query()
, there was quite some case where it returns distance of np.inf
or some absurdly large numbers like 325234.
. somehow if i query again in such a case, i got a result with reasonable distance (like, 12 or 15 in my 79-dim x 50k dataset).
When the numpy random seed is set to unlucky number 13, this code snippet:
...returns False because of 150
inf
values in d, but if you set the random seed to lucky number 8 instead, then all distances are finite. The input matrix (X.npy.zip) is all finite with shape(118198, 50)
.I like my distances reproducible, hence why I'm setting the random seed. I could use lucky numbers, but it seems likely that 13 is not special and every number is unlucky for some input. I don't think this kind of bad luck can be extremely uncommon; I've hit the bug several times and this particular instance affected one input out of 49 (all of which are N by 50 matrices with N ranging from maybe 10,000 to 2.4 million).
pynndescent.__version__
is '0.5.2' and I'm running Python 3.8.8 in Anaconda 2020.07 on macOS.Any help is appreciated. As a workaround, I can try again with a new random seed each time I get infinities (and this should still be reproducible) but it would be great to avoid the extra computations.