lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.37k stars 799 forks source link

Using distance matrix as input without specifying metric as precomputed. Why the results look better? #725

Open jmrondina opened 3 years ago

jmrondina commented 3 years ago

Thanks @lmcinnes for this amazing library!

I am using UMAP with a pre-computed dissimilarity matrix.
However, I noticed that when I inadvertently forgot to specify the metric, the results looked better than when I defined the metric as precomputed. What happens in this case? Does UMAP use the default metric (euclidean) to calculate distance between pairwise rows using the already calculated distances between data points as columns?

Many thanks.

lmcinnes commented 3 years ago

If you don't specify the metric as precomputed UMAP will interpret the input as data and use the default metric (euclidean) to compute distances between rows. So the result would have been computing euclidean distances between rows of the distance matrix.

jmrondina commented 3 years ago

Many thanks @lmcinnes for your prompt response.

This is exactly what I thought it was happen - i.e. UMAP computes Euclidean distances between each pairs of rows on the already computed distance matrix (in my case, a Gower dissimilarity matrix).

Thus I suppose this is a wrong approach, right? Not specifying the metric as precomputed causes the input to be interpreted as data (where each column should contain values from a variable instead of distance measures between data points).

Nevertheless, the results I obtain without specifying the metric seem to be better than when I set it to precomputed. Would you have any idea why this could be happening?

Please see bellow examples of plots using each approach.

No metric: ischaemic_umap_gower_standard_metric_clusters_NN15_MD01_CS50_MS50

Precomputed metric: ischaemic_umap_gower_precomputed_metric_clusters_NN15_MD01_CS50_MS50

Thank you, Jane

lmcinnes commented 3 years ago

There are reasons that this could happen, and measuring euclidean distance between rows of a distance matrix is not inherently a bad thing to be doing -- you would just want some good apriori reasons for doing so.

My best guess for the differences would be as follows: euclidean distance will give significantly more weight to features with large entries, so when measuring euclidean distance between a vector of distances to all other points the emphasis will fall on the largest distances. This means that points will be close in this measure if they are similarly positioned with respect to the points that are furthest away from them. In general I would expect this result in emphasising clusters -- which looks like what you are seeing. Now whether these are valid clusters is another question.

It also remains the case that you would want some apriori reason for doing this, or, better still, a reason that suggest why this might be successful and what a better approach actually is. It might be that you are looking for correlations in distance structures among points. Alternatively you can view the precomputed distance matrix as constructing a weighted graph of your data, and you are interested in representation and clustering of such a graph. A lot will depend on how your original distance matrix is computed and what the application is. I'm afraid I can't help as much with that,

jmrondina commented 3 years ago

Dear @lmcinnes,

Many thanks for your insightful comments!
It is very reassuring that measuring Euclidean distance between rows of a distance matrix is conceptually correct.

Initially I didn't actually have a priori reason for the approach, as I had inadvertently omitted the definition of the metric as "precomputed" when using a dissimilarity matrix as input. I soon realised the difference between the results obtained with and without the explicit definition of the metric and became intrigued.

It makes sense that when measuring Euclidean distance between a vector of distances to all other points, larger distances would be emphasised, which in turn would favour clustering. Thus, I tried to evaluate the validity of the resulting clusters comparing the differences of the variables in the original space among the obtained clusters (1 x others), aiming to investigate patterns that characterise each cluster. The result was interesting, as the clusters seem to make sense clinically. Clusters obtained when the metric was defined as precomputed also made sense in a higher level (i.e. 2 clusters only), but they were not further subdivided unless the minimum cluster size is very low, due to the high differences in densities.

As my data contains mixed data types (numeric, categorical, and binary variables), my initial approach was using Gower precomputed distances. But I also tried to combine views using resources provided in the last version of umap to combine intermediate fuzzy topological representations. It resulted in clusters that show similar patterns obtained with the first approach (Euclidean on Gower distance without precomputed metric), even though their spatial distributions look very different.

Summarising, my questions are:

1) Have you come across other applications where this happen (i.e. Euclidean distance applied on the rows of a dissimilarity matrix producing results apparently better than the precomputed distance)?

2) Would you have suggestions on the best ways to combine numeric, categorical and sparse binary variables through operations of union, intersection and contrast? It seems to involve some trial and error, but the best results I got so far were concatenating one-hot encoding categorical data with binary variables and using contrast (subtracting) numeric representations from them.

Thank you very much for your valuable help.

lmcinnes commented 2 years ago

Sorry for the (very) slow reply. I have not actually seen other applications doing this. That doesn't mean they don't exist, just that I have no come across them. As I mentioned, it is not uncommon to construct graph representations of data and then operate on that, and in some sense that can be an interpretation of what you are doing here (viewing the precomputed distance matrix as the adjacency matrix of a weighted graph). Usually, however, one would want a similarity graph in those situations.

As for the best way to combine multiple metrics/data types: I am afraid that that is very definitely an art. Naively just applying the intersection to everything is likely the best default option. A lot, however, depends on what you are trying to achieve. Ultimately anything you get out of this sort of process is really just a lens on the data -- a particular view that highlights some things, but obscures others. What is "best" depends on what you want to see. The real test is, having used such a thing to find interesting looking clusters to explore, whether these clusters have real meaning when analysed with other techniques. You seem to suggest that they do, in that they "make sense clinically", but it would be best to have some concrete ways to measure that as well.