feature request: UMAP connectivity and diagnostic plotting

maddyduran commented 3 years ago

It would be great and super useful to have the connectivity or diagnostic plotting features seen in the python UMAP implementation.

Thanks for the great work!

vertesy commented 1 year ago

This would be really great!

jlmelville commented 1 year ago

I agree some kind of diagnostic plotting is necessary for any dimensionality method which embeds a neighbor graph. I have written substantial amounts of R (and Python) plotting code for visualizing UMAP output but I don't really want to add it to uwot because I think it would result in a drastic increase in the maintenance burden.

Also I admit to being a bit of a skeptic that connectivity plots are that useful for static output. For interactive plotting it's a different matter, I think they are very informative there. But I am not sure what would constitute a useful contribution. plotly is adequate for my needs. Seems like I could end up having to support multiple output styles (e.g. base graphics, ggplot2, plotly) and still not offer something that fits into most people's workflows or graphics needs.

That said it's a bit hypocritical of me to say that diagnostic plotting is necessary and then resolutely refuse to provide any help.

vertesy commented 1 year ago

I think the reason why a static connectivity plot is helpful is because it shows you which distances are actually meaningful on a standard 2D umap.

E.g. 2 clusters may sit equally close to a third cluster but only one of them is close due to contentedness, thus meaningful, the other may only end up at the same distance because of the dimensionality compression/reduction.

I understand and agree that implementing different plotting frameworks can cause a large burden, but it may not be necessary.

jlmelville commented 1 year ago

E.g. 2 clusters may sit equally close to a third cluster but only one of them is close due to contentedness, thus meaningful, the other may only end up at the same distance because of the dimensionality compression/reduction.

Agreed about the intention. I suppose I should try and implement it and then be prepared to eat my words.

jlmelville commented 1 year ago

My initial experiments with connectivity plotting have confirmed my suspicions that without access to something that works like datashader (which the Python connectivity plotter makes use of), the naïve approach of plotting lines between the n_neighbors nearest neighbors from the original space quickly scales beyond feasibility.

As an alternative, I considered plotting just the connections between the furthest nearest neighbor of each point. Closer neighbors are more likely to be embedded closer to the point so you would probably see a higher proportion of uninteresting within-cluster lines.

Here's what this looks like for iris:

That looks ok, although I should stress that I have zero evidence that displaying the further nearest neighbor distance gives useful information about clusters or connectivity.

But iris only contains 150 points. Here is a bog-standard UMAP of the MNIST digits (N = 70,000), a more realistic case:

And here are the 15-neighbor connectivities (the equivalent of the iris plot above):

I still don't consider that static output to be all that useful, and don't actually have a way to produce an equivalent interactive plot for this yet. The very simplified method of producing those connections may also be misleading or unhelpful. A more sophisticated method processing all the neighbor connectivities to leave only the "useful" ones seems like a substantial research project on its own.

Not sure when or if I will pursue this further, but if you are able to get to the data in a form that lets you use uwot directly on a matrix or dataframe (not sure how easy that is to extract from e.g. seurat workflows) you can play about with this yourself:

conn_plot <-
  function(model,
           X,
           alpha_scale = 0.5,
           color = "black",
           lwd = 1,
           nn = NULL) {
    X <- uwot:::x2m(X)
    if (is.null(nn)) {
      if (!is.null(model$nn)) {
        nn <- model$nn[[1]]
      }
      else {
        nn <-
          uwot:::annoy_search(X, k = model$n_neighbors, ann = model$nn_index)
      }
    }

    nnf <- nn$idx[, model$n_neighbors, drop = FALSE]
    pairs <- as.matrix(reshape2::melt(nnf)[, c(1, 3)])

    coords <- model$embedding

    x0 <- coords[pairs[, 1], 1]
    y0 <- coords[pairs[, 1], 2]

    x1 <- coords[pairs[, 2], 1]
    y1 <- coords[pairs[, 2], 2]

    segments(
      x0 = x0,
      y0 = y0,
      x1 = x1,
      y1 = y1,
      col = grDevices::adjustcolor(color, alpha.f = alpha_scale),
      lwd = lwd
    )
  }

Example of using it with iris:

# ret_nn = TRUE is optional but strongly recommended
model <- umap(iris, ret_model = TRUE, ret_nn = TRUE)
plot(model$embedding, col=iris$Species)
# or vizier::embed_plot(model$embedding, iris)
conn_plot(model, iris, alpha_scale = 0.1)

Note:

You need to have reshape2 installed.
You need to have plotted the initial dataset yourself separately, via something like plot. Something as simple as plot(model$embedding) but you'll need to workout point sizes, colors and so on.
On an MNIST-sized dataset, it the function takes a while to run because it has to find the nearest neighbors and then just plotting all those lines takes ages even after the function returns. Obviously caching the nearest neighbors would help here, which you can do by generating the original UMAP model with ret_nn = TRUE. Even then, be prepared to wait several minutes with seemingly nothing happening.

vertesy commented 1 year ago

Thank you!

jlmelville commented 5 months ago

https://schochastics.github.io/edgebundle/ seems worth exploring

jlmelville / uwot

feature request: UMAP connectivity and diagnostic plotting #65