jlmelville / uwot

An R package implementing the UMAP dimensionality reduction method.
https://jlmelville.github.io/uwot/
GNU General Public License v3.0
317 stars 31 forks source link

Using sparse or distance matrices as the input data yields unexpected clustering #59

Open erzakiev opened 4 years ago

erzakiev commented 4 years ago

Hello, James. First of all, thank you very much for your awesome implementation - it is the only one to date (in R, at least) that could handle sparse matrices as an input. So I have this pre-calculated (sparse) matrix of distances that I want to use as the input to the uwot::umap function.

The function doesn't seem to accept the classical matrix of pairwise distances between a set of vectors, not in sparse form, nor in the form of a dist class object. The first is very relevant to me, as I have a huge dataset that I want to embed into 2d and it just so happens that I was able to calculate its distance matrix.

Consider example - a pre-calculated sparse distance matrix of 10 binary vectors (hence the range from 0 to 1): X <- matrix(data = 1, nrow = 10, ncol = 10); X[2,1] <- 0.01; X[6,1] <- 0.01; X[6,2] <- 0.01; X[upper.tri(X)] <- 0; X <- as(X, "dgCMatrix"); diag(X) <- 0.

Screen Shot 2020-06-13 at 22 42 25

When I pass it to the umap function uwot::umap(X, 2) I get the error message

Error in sparse_nn(X, k, include_self = include_self) : Row 10 of distance matrix has only 0 defined distances

I can pass the X as a distance matrix: u <- uwot::umap(as.dist(X), 2) and then plot the resulting embeddings: plot(x = u[,1], y = u[,2]); text(x = u[,1], y = u[,2]+0.1, labels = c(1:10)). But they are clearly off - from the planted elements of the input matrix, we would expect observations 1,2 and 6 to be clustered closely together, while everything else being scattered uniformly on the 2d plane:

Screen Shot 2020-06-13 at 22 47 52

What is the format of the input distance matrix that the function expects? Please don't tell me that by sparse matrix you meant the output of knn clustering of observations...

jlmelville commented 4 years ago

Hello, unfortunately the sparse matrix support is really rudimentary at the moment.

The main problem you are running into is that the sparse matrix format needs to be symmetric, not upper or lower triangular. I don't know if this matches your real data, but in the example you provide, the distance matrix isn't sparse in the way uwot is expecting, so just to be clear: the idea is that the distance matrix contains a small number (at least n_neighbors, and hopefully much smaller than N) of explicit distances. The rest of the entries of the matrix are considered to be unknown or uninteresting, so in the provided example I would probably drop the distance = 1 entries. However, a further caveat is that you do need to be able to define a local neighborhood for most points so if your real data has a large number of entries where all the points are considered to be equidistant, you may not see good results with any implementation of UMAP.

Of course that sparse format still shouldn't require that that the sparse matrix be symmetric: that's a complete waste of memory and the lack of a check as to whether the supplied sparse matrix is triangular (and whether it's upper or lower) is a testament to how little thought I have given this code path. Let's leave this issue open as a bug, which I will endeavor to fix.

Anyway, if you can drop the uninteresting distances, then you may be able to proceed with X <- X + Matrix::t(X) with apologies for consuming too much RAM.

I think there is also a separate problem with the example you provided, which is the large number of tied distances, and which may be confusing matters. If you set verbose = TRUE and see a message after the 'Commencing smooth kNN distance calibration' that says 'X smooth knn distance failures' this indicates that a suitable bandwidth for the knn distance smoothing couldn't be found: in the given example it fails for every point. That's why you aren't seeing the expected clustering of 1, 2 and 6. Jittering the distances a bit to create an admittedly arbitrary breaking of the ties might help. Also, uwot follows Python UMAP's lead in specifying each item as being the closest neighbor of itself, so you also probably want n_neighbors = 3 in this specific example.

Another way forward is if you can do the nearest neighbor search on the sparse matrix outside of uwot. If you can kind the k-nearest neighbors of each point (or an approximation), you can then provide the indices of the neighbors and their distances as a list of two (dense) N x k matrices, as described at https://github.com/jlmelville/uwot#nearest-neighbor-data-format. For example, row 1 of the index matrix idx would be 1 2 6 with the equivalent first row of the dist matrix as 0 0.1 0.1.

Hope some of this helps.

jlmelville commented 4 years ago

If installing from github, there is now support for passing a lower or upper triangular sparse distance matrix to X. The expected format is still that uninterestingly large distances (i.e. those with no prospect of being in the k-nearest neighbors) are omitted.

Before hitting 1.0, this should probably be moved to being a supported input for nn_method rather than X (or a function to convert to the current input format should be made available).