lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.21k stars 785 forks source link

[Proposal] Option to return sparse graph instead of embedding #493

Open ulupo opened 3 years ago

ulupo commented 3 years ago

I was wondering if the following proposal is worth a discussion. I am fully sold on UMAP being conceived as a dimensionality reduction algorithm, whose effectiveness is a function in large part of the quality of its embeddings for downstream tasks.

However, I also think that the abstract sparse graphs ("fuzzy simplicial sets") UMAP computes as part of its fit routine, prior to the embedding step, have values in their own right. One could compute all sorts of invariants from these graphs directly (you probably already suspect which sort of invariants I would like to access in giotto-tda).

In brief, I'm wondering whether there could be scope for extending the UMAP API as follows: a new init parameter mode could be added to the UMAP constructor. The default value could be 'embedding', leading to the current behaviour. There could then be another value, say 'graph' or 'fuzzy_simplicial_set', and a UMAP instance instantiated with this mode would return a sparse graph instead of an embedding in fit_transform (and skip the embedding step, of course).

lmcinnes commented 3 years ago

You can access the graph representation right now as the graph_ attribute of a fitted model, so that is something. The catch is that that doesn't fit into pipelines as I presume you would like. Some recent changes in 0.5dev have split out the portions of fit into graph construction and embedding construction (not least to aid in some work on an implementation of Parametric UMAP using neural networks to learn an embedding function). Given that it seems like this would be quite feasible. I'll see what I can do.

ulupo commented 3 years ago

Thanks @lmcinnes! Interesting to know about the work on Parametric UMAP... is there a reference already?

Yes, indeed, the ability to place the graph-only version into a pipeline is where I was coming from. Great to see it might fit into the development roadmap.

lmcinnes commented 3 years ago

All the work on Parametric UMAP is by Tim Sainburg. He has a paper in the works, so there will hopefully be something soon. In the meantime you can check #489 for the PR in progress.