lmcinnes / pynndescent

A Python nearest neighbor descent for approximate nearest neighbors
BSD 2-Clause "Simplified" License
879 stars 105 forks source link

[BUG] Error out when using 'cosine' distance metrics (ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types) #163

Open eterna2 opened 2 years ago

eterna2 commented 2 years ago


I am actually using umap, but i know it is using pynndescent under the hood. When I am running umap with > 10k rows, I get following errors:

numpy.core._exceptions.UFuncTypeError: ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types <class 'numpy.dtype[float32]'> -> None

This is the minimal reproducible codes

import numpy as np
import umap

print(100, umap.UMAP(metric="cosine").fit(np.random.random([100,10])).embedding_.shape)
print(1000, umap.UMAP(metric="cosine").fit(np.random.random([1000,10])).embedding_.shape)
print(10000, umap.UMAP(metric="cosine").fit(np.random.random([10000,10])).embedding_.shape)


This is the environment:

python 3.8.2

colorama      0.4.4  Cross-platform colored terminal text.
joblib        1.1.0  Lightweight pipelining with Python functions
llvmlite      0.34.0 lightweight wrapper around basic LLVM functionality
numba         0.51.2 compiling Python code using LLVM
numpy         1.22.0 NumPy is the fundamental package for array computing with Python.
pynndescent   0.5.5  Nearest Neighbor Descent
scikit-learn  1.0.2  A set of python modules for machine learning and data mining
scipy         1.6.1  SciPy: Scientific Library for Python
threadpoolctl 3.0.0  threadpoolctl
tqdm          4.62.3 Fast, Extensible Progress Meter
umap-learn    0.5.2  Uniform Manifold Approximation and Projection

This did not happen in the prev version of my application. I suspect might be due to the new numpy version. However, because i am also using hdbscan, it does not work with any numpy version except 1.22.0.

eterna2 commented 2 years ago

i have tested with numpy=1.20.3 and it works normally.

lmcinnes commented 2 years ago

It looks like an issue somewhere in the interactions of numba, numpy and (presumably) numpy's addition of type signature information which is fairly new. I'm not sure there is an easy fix for this, as it is in interactions of upstream libraries, so it is going to take me a while to figure out how to make it work. In the meantime hopefully older versions of numpy work for now. I'll see if I can figure something out though.

lmcinnes commented 2 years ago

Digging in a little more; currently numba does not support numpy >= 1.21, so things are potentially just going to break. It seems highly likely they will fix that in future, but I have no idea of timelines. The interplay between getting hdbscan (with it's Cython compilation) working with numpy is hairy and frustrating. I'm not sure I have any good immediate work-arounds.

lmcinnes commented 2 years ago

So I have a workaround that may get you past this particular issue. It's not pretty, but it should do the job. In umap/distances.py there is a function definition:

def correct_alternative_cosine(d):
    return 1.0 - pow(2.0, -d)

If you change that to

def correct_alternative_cosine(ds):
    result = np.empty_like(ds)
    for i in range(ds.shape[0]):
        result[i] = 1.0 - np.power(2.0, ds[i])
    return result

Then this should avoid the issue -- it seems specifically related to the numba.vectorize. Potentially you can just make this edit in your installed copy of umap in site-packages and have it work.

eterna2 commented 2 years ago

Thanks. I will give it a try!

Kydlaw commented 2 years ago


I just took the same bullet.


I used a higher version of Numpy as a fix to https://github.com/scikit-learn-contrib/hdbscan/issues/457.

Having that one fixed, I stumbled on this issue. So I tried the fix you suggested in:

[...] In umap/distances.py there is a function definition:

def correct_alternative_cosine(d):
    return 1.0 - pow(2.0, -d)

If you change that to

def correct_alternative_cosine(ds):
    result = np.empty_like(ds)
    for i in range(ds.shape[0]):
        result[i] = 1.0 - np.power(2.0, ds[i])
    return result

[...] you can just make this edit in your installed copy of umap in site-packages and have it work.

This change works for me. However, small correction here: the distance definition is not in umap/distances.py but in pynndescent/distances.py.

So, if you are using venv, in .venv/lib/pythonX.X/site-packages/pynndescent/distances.py apply the changes suggested.

lmcinnes commented 2 years ago

Thanks for letting me know it works, and also for the correction on where to make the change!

jjsnlee commented 2 years ago

Another option (not having to mess around in an install env) is to do the following somewhere in your own code:

import pynndescent
pynn_dist_fns_fda = pynndescent.distances.fast_distance_alternatives
pynn_dist_fns_fda["cosine"]["correction"] = correct_alternative_cosine
pynn_dist_fns_fda["dot"]["correction"] = correct_alternative_cosine
vkartha commented 1 year ago

Runnning into this issue currently and trying the above comment, it can't find correct_alternative_cosine:

NameError: name 'correct_alternative_cosine' is not defined

tried to change it to pynndescent.distances.correct_alternative_cosine but that gave the original error as well

garchaaman19 commented 1 year ago

I get the same error for below code

 umap_embeddings = umap.UMAP(n_neighbors=np.min([5, data_df.shape[0]]),

The above code works for files containing number of lines <3k but fails for >5k and there after. @lmcinnes Can you please help.