Open sgbaird opened 3 years ago
I'm noticing that very small datasets (e.g. 100 points) seems to be slower than slightly larger datasets (e.g. 800 points) according to the docs: benchmarking results, though I haven't been able to reproduce this behavior.
Oddly (to me at least), it seems to run ~2x faster in an IPython kernel 🤨
C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Euclidean UMAP] Elapsed: 6.28317 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed Euclidean UMAP] Elapsed: 3.50397 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD UMAP] Elapsed: 1.83107 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD densMAP] Elapsed: 12.95584 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD densMAP] Elapsed: 4.81335 C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:1735: UserWarning: using precomputed metric; transform will be unavailable for new data and inverse_transform will be unavailable for all data warn( C:\Users\sterg\Anaconda3\envs\elm2d-crabnet\lib\site-packages\umap\umap_.py:2213: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( [Precomputed EMD UMAP] Elapsed: 1.92252
Also, setting random_state=42
didn't seem to affect it much.
Perhaps I should use PCA as a fast swap-in, and then switch back to UMAP for production purposes. For densMAP values (i.e. radius), I might just need to use a random number generator.
There is definitely some overhead that is going to be bad for the small cases, but largely irrelevant for the large cases. As to actually being slower for very small cases -- there's no obvious reason why that should be so, but may be due to the optimization phase simply not working well for small data. The sampling based methodology, particularly the negative sampling, works fine for larger datasets, but for very small datasets it starts violating assumptions (that, in all likelihood, a random pair of samples is unrelated). This could conceivably slow things down (and I wouldn't trust the results either). As a rule of thumb I would be worried whenever n_neighbors / n_samples isn't around 0.1 or less -- the smaller the better with regard to optimization assumptions.
Is it normal for UMAP or densMAP to take 5-40 s to fit a small dataset of precomputed distances (e.g. 10x10 distance matrix)? My "test" scripts for a separate project which include a couple of UMAP calls end up taking about as much time as if I used a
10000 x 10000
distance matrix.Reproducible script:
Output: