lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.42k stars 806 forks source link

Fitting a sparse matrix with at least 4,096 rows fails when using correlation as the distance metric #472

Open msto opened 4 years ago

msto commented 4 years ago

Hi,

I'm encountering an error when attempting to run UMAP using correlation as the distance metric. I've reduced my code to a minimal reproducible example below.

import numpy as np
from scipy import sparse
import umap

np.random.seed(149)

X = sparse.rand(5000, 1000)
embed = umap.UMAP(metric='correlation').fit_transform(X)

This results in the following error:

TypingError                               Traceback (most recent call last)
<ipython-input-15-247e767a9e1f> in <module>
      1 X = sparse.rand(5000, 5000)
----> 2 embed = umap.UMAP(metric='correlation').fit_transform(X)

~/.conda/envs/py37/lib/python3.7/site-packages/umap/umap_.py in fit_transform(self, X, y)
   2012             Embedding of the training data in low-dimensional space.
   2013         """
-> 2014         self.fit(X, y)
   2015         return self.embedding_
   2016

~/.conda/envs/py37/lib/python3.7/site-packages/umap/umap_.py in fit(self, X, y)
   1799                 self.low_memory,
   1800                 use_pynndescent=True,
-> 1801                 verbose=self.verbose,
   1802             )
   1803

~/.conda/envs/py37/lib/python3.7/site-packages/umap/umap_.py in nearest_neighbors(X, n_neighbors, metric, metric_kwds, angular, random_state, low_memory, use_pynndescent, verbose)
    362                     leaf_array=leaf_array,
    363                     n_iters=n_iters,
--> 364                     verbose=verbose,
    365                 )
    366             else:

~/.conda/envs/py37/lib/python3.7/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
    413                 e.patch_message(msg)
    414
--> 415             error_rewrite(e, 'typing')
    416         except errors.UnsupportedError as e:
    417             # Something unsupported is present in the user code, add help info

~/.conda/envs/py37/lib/python3.7/site-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
    356                 raise e
    357             else:
--> 358                 reraise(type(e), e, None)
    359
    360         argtypes = []

~/.conda/envs/py37/lib/python3.7/site-packages/numba/core/utils.py in reraise(tp, value, tb)
     78         value = tp()
     79     if value.__traceback__ is not tb:
---> 80         raise value.with_traceback(tb)
     81     raise value
     82

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Internal error at <numba.core.typeinfer.CallConstraint object at 0x7f1ff42e6110>.
missing a required argument: 'n_features'
During: resolving callee type: type(CPUDispatcher(<function sparse_correlation at 0x7f1ff4aef0e0>))
During: typing of call at /home/mstone/.conda/envs/py37/lib/python3.7/site-packages/umap/sparse_nndescent.py (235)

Enable logging at debug level for details.

File ".conda/envs/py37/lib/python3.7/site-packages/umap/sparse_nndescent.py", line 235:
def sparse_nn_descent(
    <source elided>

            d = sparse_dist(from_inds, from_data, to_inds, to_data)
            ^

The error does not appear when using the default Euclidean metric, nor when providing a dense numpy matrix.

# Both work OK
embed = umap.UMAP().fit_transform(X)
embed = umap.UMAP(metric='correlation').fit_transform(X.toarray())

The error appears to start occurring when the matrix has at least 4,096 rows, and is unaffected by the number of features.

# Works OK
X = sparse.rand(4095, 1000)
embed = umap.UMAP(metric='correlation').fit_transform(X)

# Error
X = sparse.rand(4096, 1000)
embed = umap.UMAP(metric='correlation').fit_transform(X)

Is there anything special about exceeding 2^12 rows that might be causing this?

Thanks!

lmcinnes commented 4 years ago

Hmm, seems like the sparse case missed the fact that we need n_features for correlation distances. The "special" thing is likely the fact that for less than 4096 samples it will just compute all pairs distances (since this is cheaper when the dataset size is small). I'll see if I can get this fixed when I get some time.

aCampello commented 3 years ago

I am having exactly the same error here! Has this been fixed? What would be the roadmap to fix? I am happy to give it a go.

aCampello commented 3 years ago

I also have a similar error even when metric=Euclidean, but this is related to pickling it. I will try and get a minimal example an open an issue about it.