lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.44k stars 808 forks source link

ValueError: Precomputed metric requires shape (n_queries, n_indexed) #190

Open jolespin opened 5 years ago

jolespin commented 5 years ago

I just wanted to bring to your attention this error message. I believe this error is a little misleading because the algorithm works for n_neighbors=15 but not n_neighbors=3. Do you know what it could be in the backend that is preventing it from working for n_neighbors=3 and throwing the shape message?

umap.__version__
0.3.7

# Shape?
print(X.shape)
​(5843, 5843)

# Symmetric?
def check_symmetric(a, tol=1e-8):
    return np.allclose(a, a.T, atol=tol)
print(check_symmetric(X))
​True

# Nulls?
print(np.any(X.isnull()))
​False

# Diagonal? 
print(np.unique(np.diagonal(X.values)))
​[0.]

# UMAP Precomputed
model = UMAP(n_neighbors=3, metric="precomputed")
embeddings = model.fit_transform(X)

Error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-40-44805956fe15> in <module>
     18 # UMAP Precomputed
     19 model = UMAP(n_neighbors=3, metric="precomputed")
---> 20 embeddings = model.fit_transform(X)

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/umap_.py in fit_transform(self, X, y)
   1564             Embedding of the training data in low-dimensional space.
   1565         """
-> 1566         self.fit(X, y)
   1567         return self.embedding_
   1568 

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/umap_.py in fit(self, X, y)
   1536             self.metric,
   1537             self._metric_kwds,
-> 1538             self.verbose,
   1539         )
   1540 

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/umap_.py in simplicial_set_embedding(data, graph, n_components, initial_alpha, a, b, gamma, negative_sample_rate, n_epochs, init, random_state, metric, metric_kwds, verbose)
    941             random_state,
    942             metric=metric,
--> 943             metric_kwds=metric_kwds,
    944         )
    945         expansion = 10.0 / initialisation.max()

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/spectral.py in spectral_layout(data, graph, dim, random_state, metric, metric_kwds)
    238             random_state,
    239             metric=metric,
--> 240             metric_kwds=metric_kwds,
    241         )
    242 

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/spectral.py in multi_component_layout(data, graph, n_components, component_labels, dim, random_state, metric, metric_kwds)
    120             dim,
    121             metric=metric,
--> 122             metric_kwds=metric_kwds,
    123         )
    124     else:

~/anaconda/envs/µ_env/lib/python3.6/site-packages/umap/spectral.py in component_layout(data, n_components, component_labels, dim, metric, metric_kwds)
     51 
     52     distance_matrix = pairwise_distances(
---> 53         component_centroids, metric=metric, **metric_kwds
     54     )
     55     affinity_matrix = np.exp(-distance_matrix ** 2)

~/anaconda/envs/µ_env/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1381 
   1382     if metric == "precomputed":
-> 1383         X, _ = check_pairwise_arrays(X, Y, precomputed=True)
   1384         return X
   1385     elif metric in PAIRWISE_DISTANCE_FUNCTIONS:

~/anaconda/envs/µ_env/lib/python3.6/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype)
    118                              "(n_queries, n_indexed). Got (%d, %d) "
    119                              "for %d indexed." %
--> 120                              (X.shape[0], X.shape[1], Y.shape[0]))
    121     elif X.shape[1] != Y.shape[1]:
    122         raise ValueError("Incompatible dimension for X and Y matrices: "

ValueError: Precomputed metric requires shape (n_queries, n_indexed). Got (291, 5843) for 291 indexed.
lmcinnes commented 5 years ago

Ah, that's the multi-component spectral initialisation failing, because it doesn't support pre-computed metrics. I'm on vacation at the moment, but I can make a better error message when I get back.

sleighsoft commented 5 years ago

It has been a while but this seems to be the cause: https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering

This is SpectralClustering but the same goes for SpectralEmbedding which is used by UMAP. They both expect an affinity/similarity matrix and not a distance matrix.

This could probably be solved by using the solution provided in the link:

similarity = np.exp(-beta * distance / distance.std())

And then passing similarity to SpectralEmbedding within UMAP.

Pfeil commented 5 years ago

I also came across this problem. I calculated 3 distance matrices with 3 different (custom) metrics. Only one failed. I am not sure wether this makes the other two results wrong, but looking at the solution of sleighsoft they probably are? Yet, the results do not look that wrong. Which is kind of a dangerous thing, then. As a temporary solution I now use init='random', which seems to work.

Vykintasj commented 5 years ago

Hi, sorry, Is this issue being looked into? Otherwise maybe you could suggest methods to recreate original datapoints if you only have a distance matrix? The N(dim) is unknown in my case, but I assume it is possible to find a perfect embedding when selecting N(dim)=N(samples).

charliemmm commented 1 year ago

Hello,

Is this issue at all being looked into? With the new HDBscan algorithm being implemented into scikit-learn and its impending medoid/centroid features, I would hope somebody would help solve this issue.