lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.36k stars 799 forks source link

Suggestion: pass sparse matrices to AlignedUMAP with consistent row indices, rather than requiring relations dicts #543

Open dmarx opened 3 years ago

dmarx commented 3 years ago

Love the addition of the Aligned UMAP, but I think the relations dict piece is an unnecessary overcomplication. Just have the user pass in sparse matrices where entity_i is always represented by row_i across slices. If that entity doesn't appear in a particular slice, just don't populate its row.

Specifically, this will resolve cases where an entity is not present across all time slices. Chaining relation dicts (per the current implementation) only works for entities that appear in all slices. If an entity is absent from slice i, then there's no way dict chaining can assist aligning its appearance in slice{i-1} with slice{i+1}.

Under this approach, we can treat the row index as a unique identifier which is consistent across all slices. As long as we're already asking the users to be responsible for constructing the sequence of relations dicts, I don't think imposing this kind of constraint is a huge ask. At the very least, it could be an alternative way of invoking AlignedUMAP, leaving the relations dicts available as an optional feature for anyone who prefers it. Additionally, it also creates opportunity to pass the slices in as a tensor instead of a list.

lmcinnes commented 3 years ago

Thanks for the suggestion @dmarx . This was actually an approach I did consider, but there was the difficulty that all zero sample vectors do occur in real data, and potentially pose a problem for this approach. In the end I went with the more generic (though possibly less immediately user friendly) approach of using the relation dictionaries. I am still open to playing with the API however, and I will certainly consider the option as a potentially overloaded / alternate option in the future. To provide the data as a tensor in this form it would likely be best to make use of the pydata/sparse package, which isn't yet supported by umap (just scipy.sparse).

In terms of allowing relations that skip over slices, as you suggest: my hope had been for a more complex API option in which one provides a DAG of overarching relations, and a relation dictionary for each edge in the DAG. This would allow for a much greater degree of generality again -- beyond simple linear arrangements of slices. That, however, is for future work if at all.

dmarx commented 3 years ago

I hadn't thought about zero-valued vectors, that makes sense. Maybe a work around could be allowing the user to optionally pass in a list/tensor of indices that are active at each time slice so users can make it explicit if any of the "empty" vectors should be considered present?

I was thinking of tinkering with this myself but I've been having some difficulty navigating some of the code. Specifically, it's unclear to me how slices within the alignment_window radius get used. Do we only consider the slices at the window extremes, or all slices in the range? Will there be a publication describing AlignedUMAP?

Also, while we're chatting: another idea I had for tinkering with AlignedUMAP was adding support for aligning with parameterized UMAP, where the alignment could use the parameterizations from other slices as weight initializations and use the alignment cost in the optimization. Anything like this in the works?

lmcinnes commented 3 years ago

Thanks for looking into it -- I would certainly welcome PRs for alternative APIs (ideally adding options to the existing one). With respect to the aligment_window_radius: everything inside the window range is used for alignment, however there is an effective decaying kernel applied (so more distance slices apply less regularisation than closer ones).

In terms of publication -- I hadn't thought too hard about it. In practice the idea itself is relatively straightforward (and, indeed, there are some papers that independently arrived at essentially similar solutions (2-MAP for aligning two slices and a general approach for embedding alignment), as well as older papers such as Dynamic t-SNE that, again, arrive at a roughly similar approach.

In the case AlignedUMAP there are some specific details, such as the use of variable relations (and ultimately, ideally DAGs of relations) among slices, windowing of slices, chaining of relations, local neighborhood stability weighting, and the exact regularisation measures used that are unique, but I am not sure if they amount to a paper. I could be convinced otherwise -- especially if the code is challenging for others to navigate without some clearer foundations (although perhaps some better code commenting on my part could alleviate some of that).

From a more theoretical standpoint I have given some talks on this approach, drawing on the MAPPER algorithm as my means of deriving the approach. You can find one here although depending on background this may or may not be the best introduction.

A combination of AlignedUMAP and ParametricUMAP (using transfer learning techniques) is certainly something I have thought about, and I agree that it is a very promising approach. I don't have anything in the works on that right now -- I am having to put aside major UMAP work for a little while. I would be very excited to see such a thing if other people were interested in taking it on however, and would be happy to help out where possible.

One final thing: another option I have been thinking about, but haven't come up with a clear idea and / or implementation of is attempting to regularise not only the position of a point across slices, but also the velocity of the point (i.e. the differential between slices). I have rough ideas, but do not currently have the time to try them in practice.