lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.44k stars 807 forks source link

Slow fit for small datasets using UMAP or densMAP #766

Open sgbaird opened 3 years ago

sgbaird commented 3 years ago

Is it normal for UMAP or densMAP to take 5-40 s to fit a small dataset of precomputed distances (e.g. 10x10 distance matrix)? My "test" scripts for a separate project which include a couple of UMAP calls end up taking about as much time as if I used a 10000 x 10000 distance matrix.

Reproducible script:

"""Time various calls to UMAP."""
import umap
import numpy as np
from time import time
from scipy.spatial.distance import pdist, squareform

class Timer(object):
    """
    Simple timer class.

    https://stackoverflow.com/a/5849861/13697228
    Usage
    -----
    with Timer("description"):
        # do stuff
    """

    def __init__(self, name=None):
        """Record name."""
        self.name = name

    def __enter__(self):
        """Enter the timer."""
        self.tstart = time()

    def __exit__(self, type, value, traceback):
        """Exit the timer."""
        if self.name:
            print(
                "[%s]" % self.name,
            )
        print(("Elapsed: {}\n").format(round((time() - self.tstart), 5)))

X = np.random.rand(10, 5)

eucl_dm = squareform(pdist(X))

# fmt: off
# Earth Mover's Distance matrix (not related to X)
emd_dm = np.array([[ 0.        , 27.        , 27.        ,  6.86666667, 15.        ,
        19.65      , 10.8       , 15.6       ,  9.71428571, 19.6       ],
       [27.        ,  0.        , 14.5       , 26.66666667, 20.6       ,
        20.25      , 35.        , 19.        , 19.42857143, 25.        ],
       [27.        , 14.5       ,  0.        , 23.33333333, 14.2       ,
        11.25      , 24.        , 18.6       , 18.42857143, 25.        ],
       [ 6.86666667, 26.66666667, 23.33333333,  0.        , 12.66666667,
        18.41666667, 13.        , 13.26666667,  9.57142857, 16.33333333],
       [15.        , 20.6       , 14.2       , 12.66666667,  0.        ,
         8.15      , 18.8       , 19.4       , 16.42857143, 24.6       ],
       [19.65      , 20.25      , 11.25      , 18.41666667,  8.15      ,
         0.        , 19.75      , 23.85      , 17.53571429, 29.25      ],
       [10.8       , 35.        , 24.        , 13.        , 18.8       ,
        19.75      ,  0.        , 24.6       , 17.71428571, 29.        ],
       [15.6       , 19.        , 18.6       , 13.26666667, 19.4       ,
        23.85      , 24.6       ,  0.        ,  9.02857143,  6.8       ],
       [ 9.71428571, 19.42857143, 18.42857143,  9.57142857, 16.42857143,
        17.53571429, 17.71428571,  9.02857143,  0.        , 13.42857143],
       [19.6       , 25.        , 25.        , 16.33333333, 24.6       ,
        29.25      , 29.        ,  6.8       , 13.42857143,  0.        ]]) #
# fmt: on

with Timer("Euclidean UMAP"):
    umap.UMAP().fit(X)

with Timer("Precomputed Euclidean UMAP"):
    umap.UMAP(metric="precomputed").fit(eucl_dm)

with Timer("Precomputed EMD UMAP"):
    umap.UMAP(metric="precomputed").fit(emd_dm)

with Timer("Precomputed EMD densMAP"):
    umap_trans = umap.UMAP(
        densmap=True,
        output_dens=True,
        dens_lambda=1.0,
        min_dist=0.01,
        n_components=2,
        metric="precomputed",
    ).fit(emd_dm)

with Timer("Precomputed EMD densMAP"):
    std_trans = umap.UMAP(
        densmap=True,
        output_dens=True,
        dens_lambda=1.0,
        metric="precomputed",
    ).fit(emd_dm)

with Timer("Precomputed EMD UMAP"):
    std_trans = umap.UMAP(metric="precomputed").fit(emd_dm)

Output:

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Euclidean UMAP]
Elapsed: 14.15193

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data
  warn(
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Precomputed Euclidean UMAP]
Elapsed: 8.00694

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data
  warn(
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Precomputed EMD UMAP]
Elapsed: 4.76215

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data
  warn(
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Precomputed EMD densMAP]
Elapsed: 28.30055

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data
  warn(
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Precomputed EMD densMAP]
Elapsed: 11.316

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data
  warn(
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Precomputed EMD UMAP]
Elapsed: 5.19057
sgbaird commented 3 years ago

I'm noticing that very small datasets (e.g. 100 points) seems to be slower than slightly larger datasets (e.g. 800 points) according to the docs: benchmarking results, though I haven't been able to reproduce this behavior.

Oddly (to me at least), it seems to run ~2x faster in an IPython kernel 🤨

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Euclidean UMAP]
Elapsed: 6.28317

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data
  warn(
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Precomputed Euclidean UMAP]
Elapsed: 3.50397

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data
  warn(
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Precomputed EMD UMAP]
Elapsed: 1.83107

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data
  warn(
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Precomputed EMD densMAP]
Elapsed: 12.95584

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data
  warn(
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Precomputed EMD densMAP]
Elapsed: 4.81335

C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data
  warn(
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1
  warn(
[Precomputed EMD UMAP]
Elapsed: 1.92252

Also, setting random_state=42 didn't seem to affect it much.

sgbaird commented 3 years ago

Perhaps I should use PCA as a fast swap-in, and then switch back to UMAP for production purposes. For densMAP values (i.e. radius), I might just need to use a random number generator.

lmcinnes commented 3 years ago

There is definitely some overhead that is going to be bad for the small cases, but largely irrelevant for the large cases. As to actually being slower for very small cases -- there's no obvious reason why that should be so, but may be due to the optimization phase simply not working well for small data. The sampling based methodology, particularly the negative sampling, works fine for larger datasets, but for very small datasets it starts violating assumptions (that, in all likelihood, a random pair of samples is unrelated). This could conceivably slow things down (and I wouldn't trust the results either). As a rule of thumb I would be worried whenever n_neighbors / n_samples isn't around 0.1 or less -- the smaller the better with regard to optimization assumptions.