lmcinnes / umap

Uniform Manifold Approximation and Projection
BSD 3-Clause "New" or "Revised" License
7.45k stars 808 forks source link

What does 'epochs_per_sample', 'epochs_of_next_sample' and 'epochs_per_negative_sample' mean in the`optimize_layout()` function? #165

Open jiadongdan opened 6 years ago

jiadongdan commented 6 years ago

Source code in optimize_layout() function: image

According to this page. The gradient is

image

  1. Where is the term Vij in the the source code?
  2. What does epochs_per_sample, epochs_of_next_sample and epochs_per_negative_sample mean in theoptimize_layout() function?

Anyone can give some hint on this? Thanks!

lmcinnes commented 6 years ago

Since UMAP makes use of the sampling based approach employed by LargeVis, the goal is to sample edges proportionally to the weights v_ij. In practice this is done by by drawing and edge every t epochs, where t is chosen according the the relative proportion of v_ij. This is the epochs_per_sample vector, telling you how many epochs between each sampling of a given edge. The epoch_of_next_sample simply keeps track of when to next sample each edge (in the obvious way). Finally epochs_per_negative_sample plays a similar role for the negative sampling.

All of this is really just an optimization trick -- it goes a little faster doing it this way (since are already doing the negative sampling, so we are in a sampling style regime)than the more obvious approach. I'm still working to find ways to make the more obvious thing "fast enough" and hope to eventually switch away, as the sampling is not ideal, for exactly the reason you ask this question: it is not obvious to a reader of the code what is going on.

jiadongdan commented 6 years ago

Thank you for your reply! This sampling technique a bit confusing. My understanding is: Points which are closer in high dimensional space will get moved first. Here are my comments on: " the goal is to sample edges proportionally to the weights v_ij":

The proportionality here is kind of tricky. It seems that the initial epochs_per_sample is 1/v_ij. then you keep track of them by adding previous value. However, what if we are treating this way:

As v_ij is in the range (0, 1]. for elements with indices (i, j), if v_ij == 1, then they are changing throughout all the epochs. if v_ij = 0.1, then they are changing positions in the last 1/10 of all epochs. It seems in this way, we will mobilize every points at least once. In your way, points too far way will never be changed from initial layout if 1/(v_ij) > num_epochs.

If we follow this way of proportionality, will there be a performance difference?

I am not sure my understanding is correct or not. I checked paper on LargeVis, it says little about the probabilistic sampling. To be honest, this part is most confusing for me. Hope there is a bit theoretical validation on this point. Thank you so much for your points. Hope there are some more clarification on edge sampling in UMAP docs

lmcinnes commented 6 years ago

Note that we truncate away anything with 1/(v_ij) > num_epochs; so yes, there is some loss of accuracy here. It seems to be very small in practice. And it is important to spread all the movement out -- if we move some points early and don't move other points until the very end we lose much of the benefit of stochastic gradient descent(it isn't properly stochastic anymore).

There isn't really any further documentation of theory for this -- it is a fairly straightforward approximation to the weighted version, and fits neatly with the negative sampling. I don't think there is really much useful to say about it that isn't essentially just a recapitulation of the code itself. People interested in that level of detail are better off reading the code as that is a very explicit and accurate description of what is actually happening. I have tried to ensure the code itself remains fairly readable.

jiadongdan commented 5 years ago

Hi, lmcinnes, I recently read both the LargeVis paper and your paper, and also your source code. In your paper, you mentioned a different cost function to compared with LargeVis. As LargeVis source code is written in C++, it's a bit hard for me to follow how the author conducted the negative sampling. Comparison for me is difficult. Would you mind explaining me how your negative sampling is different from theirs? Because the cost function is a bit different, from gamma in LargeVis to (1-Vij) in your cost function (I noticed you have a gamma, i.e. the repulsive strength I think), I guess there should be a bit different. I am trying to digest the LargeVis source code...Hope you can help me, thank you.

Cheers!

lmcinnes commented 5 years ago

I believe UMAP differs from LargeVis in the distribution it draws negative samples from. In principle LargeVis uses the marginal distribution raised to the power 0.75, based solely on word2vec, while UMAP uses a flat distribution since the objective function derives a sampling distribution that is very close to that, so it is deemed a good enough approximation.

jlmelville commented 5 years ago

For the record, the relevant parts of the LargeVis code is at: https://github.com/lferry007/LargeVis/blob/feb8121e8eb9652477f7f564903d189ee663796f/Linux/LargeVis.cpp#L554

you can see the weights being raised to the power of 0.75. Just below that the neg_table is built with these values, which are then sampled from during negative sampling: https://github.com/lferry007/LargeVis/blob/feb8121e8eb9652477f7f564903d189ee663796f/Linux/LargeVis.cpp#L600

neg_table is just a really big array (of length 1e8), filled by the vertex ids, which are repeated proportional to their value in weights.

jiadongdan commented 5 years ago

@jlmelville, @lmcinnes thank you for pointing out that. It seems 0.75 is a weight influence factor. In fact, I have rewrite some of the UMAP code to make it faster. I have two findings after trying my data.

  1. I noticed that the number of negative sample is dynamically determined on each iteration. It just fluctuated with the num_neg_samples we pass into. So I set it a fixed value, it seems that it doest not harm the final approximation of the result. In this way, we only need to keep track of epoch_of_next_sample. Maybe you can test that.

  2. Clip operations do have a strong influence on the results. It seems make the clustering more compact together. You guys scale you initial embedding to 10, maybe you can try out an optimum clip value at this scale.

I can contribute my code of fast computing a graph in matrix form, i.e. fuzzy_simplicial_set, btw.

Thank u all!

why94nb commented 4 years ago

@lmcinnes Thank you for the amazing job you guys have done! One question: In the optimize_layout(), you are not actually optimizing the CE function, but another one that approximate the CE loss function, is that correct? If so, why not directly using the approximation form as the loss function in the paper?

Thank you a lot.