lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.45k stars 808 forks source link

Multiplying by edge_weights when doing keras nonparametric UMAP #803

Open jonathangomesselman opened 2 years ago

jonathangomesselman commented 2 years ago

Within the ParametricUMAP class one particular parameter of interest is parametric_embedding. The documentation described that when set to false, a non-parametric embedding is learned, using the same code as the parametric embedding, which can serve as a direct comparison between parametric and non-parametric embedding using the same optimizer; however, upon close inspection it seems that in the nonparametric case the compute_loss function is slightly different. Namely, the loss is additionally multiplied by the edge_weights:

if not parametric_embedding:
        # multiply loss by weights for nonparametric
        weights_tiled = np.tile(edge_weights, negative_sample_rate + 1)

...

if not parametric_embedding:
            ce_loss = ce_loss * weights_tiled

Do you think someone could provide an intuition for this difference and if it affects our ability to directly compare between parametric and non-parametric embeddings. Specifically, I am hoping to use the ce_loss as a proxy for comparing the performance of PUMAP vs. UMAP on my data.

lmcinnes commented 2 years ago

This is the right thing to do; the difference is that the SGD in the basic UMAP implementation uses sampling to apply edge weights (edges are sampled proportionally to their weights). For a generic optimizer not doing sampling we need to actually apply the edge weights.

jonathangomesselman commented 2 years ago

Thank you @lmcinnes! While digging a bit more through the code yesterday, I think I stumbled upon this. I did not realize that when optimizing UMAP through Keras you are optimizing all of the edges at once rather than doing per edge SGD as in UMAP. Since I am interested in comparing the loss across UMAP and PUMAP, do you feel as though the losses are still comparable (i.e. in magnitude)?

I had two additional questions regarding this. Firstly, mathematically, is there a reason that you decided to do per edge SGD optimization in UMAP rather than mini-batch or full batch optimization? For example, as the dataset size increases, do you expect issues where an update to the embeddings of one edge quickly gets undone / changed when updating another edge.

My last question was why you decided to implement the tensorflow version of non-parametric UMAP with full batch SGD rather than just piggy backing on the ParametericUMAP implementation that uses mini-batch SGD and incorporates sampling by edge weights by computing epochs_per_sample?

Thanks so much for the feedback! The UMAP algorithm is fascinating, and the paper is very well written. I have thoroughly enjoyed getting deeply acquainted with this work.