lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.44k stars 807 forks source link

alignedUMAP ignores Disconnected_distance? #592

Open vpoulin opened 3 years ago

vpoulin commented 3 years ago

Hi,

From what I understand, when UMAP builds its knn-graph, it does not introduce edges between points at maximal distance (Disconnection_distance = 1 in case of Jaccard). I've observed that alignedUmap does not treat maximally distant points the same way and introduces edges between these points (and clusters these dissimilar points together).

from scipy import sparse
import numpy as np
import umap
import umap.aligned_umap

# Matrix M has 9 rows: 
# points 6,7,8 are maximally dissimilar to all other points (Jaccard similarity of 0)
i_index = [0,0,0,1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,7,8]
j_index = [1,2,3,0,2,3,0,1,3,4,5,6,4,6,7,5,6,7,8,9,10]
v_index = [1]*len(i_index)
M = sparse.csr_matrix((v_index,(i_index, j_index)))

# Align M with itself
slices = [M, M]
relations = [dict((i,i) for i in range(M.get_shape()[0]))]
embed = umap.AlignedUMAP(n_neighbors=3, metric='jaccard').fit(slices, relations=relations).embeddings_

# Visualize aligned embeddings
import matplotlib.pyplot as plt
plt.scatter(embed[0][:, 0],embed[0][:, 1])
plt.scatter(embed[1][:, 0],embed[1][:, 1], color='orange')
for i in [6,7,8]:
    plt.annotate(str(i), (embed[0][i, 0], embed[0][i, 1]))
    plt.annotate(str(i), (embed[1][i, 0], embed[1][i, 1]))
plt.title('Points 6,7 and 8 are clustered/aligned despite maximal pairwise distances.')    

#Compare to:
M_embed = umap.UMAP(metric='jaccard', n_neighbors=3).fit_transform(M)
plt.scatter(M_embed[:, 0], M_embed[:, 1])
lmcinnes commented 3 years ago

That's definitely a bug, but a somewhat non-obvious one to me. I'll have to dig around in the code a bit to figure out what is going on. Thanks for the report! These sorts of feature combination bugs are notoriously hard to comprehensively test for.

lmcinnes commented 3 years ago

It looks like it was pretty simple in the end -- and actually highlighted some significant shortcomings, which have since been fixed as well. So again, thanks for pointing this out -- helped fix a lot of other bugs as well.