Open jonathangomesselman opened 2 years ago
This is the right thing to do; the difference is that the SGD in the basic UMAP implementation uses sampling to apply edge weights (edges are sampled proportionally to their weights). For a generic optimizer not doing sampling we need to actually apply the edge weights.
Thank you @lmcinnes! While digging a bit more through the code yesterday, I think I stumbled upon this. I did not realize that when optimizing UMAP through Keras you are optimizing all of the edges at once rather than doing per edge SGD as in UMAP. Since I am interested in comparing the loss across UMAP and PUMAP, do you feel as though the losses are still comparable (i.e. in magnitude)?
I had two additional questions regarding this. Firstly, mathematically, is there a reason that you decided to do per edge SGD optimization in UMAP rather than mini-batch or full batch optimization? For example, as the dataset size increases, do you expect issues where an update to the embeddings of one edge quickly gets undone / changed when updating another edge.
My last question was why you decided to implement the tensorflow version of non-parametric UMAP with full batch SGD rather than just piggy backing on the ParametericUMAP implementation that uses mini-batch SGD and incorporates sampling by edge weights by computing epochs_per_sample
?
Thanks so much for the feedback! The UMAP algorithm is fascinating, and the paper is very well written. I have thoroughly enjoyed getting deeply acquainted with this work.
Within the
ParametricUMAP
class one particular parameter of interest isparametric_embedding
. The documentation described that when set to false, a non-parametric embedding is learned, using the same code as the parametric embedding, which can serve as a direct comparison between parametric and non-parametric embedding using the same optimizer; however, upon close inspection it seems that in the nonparametric case thecompute_loss
function is slightly different. Namely, the loss is additionally multiplied by the edge_weights:Do you think someone could provide an intuition for this difference and if it affects our ability to directly compare between parametric and non-parametric embeddings. Specifically, I am hoping to use the ce_loss as a proxy for comparing the performance of PUMAP vs. UMAP on my data.