Closed dkobak closed 1 year ago
To build on top of https://github.com/pavlin-policar/openTSNE/pull/220#discussion_r1017919909, I've run a quick experiment to see convergence rates using lr/exaggeration, and the results indicate that N/exag may actually lead to better and faster convergence than simply dividing by N/ee_rate:
Visually, all four embeddings look pretty similar, and I wouldn't say ones are less converged than the others, but it seems like we can get the same KL divergence faster this way. What do you think?
Does exag
in this plot mean early exaggeration? And late exaggeration is set to 1? Is this MNIST?
Does exag in this plot mean early exaggeration?
Yes, the "regular" phase is run with exaggeration=1
Is this MNIST?
No, this is my typical macosko example.
This is one example, so we'd definitely have to check it on several more, but this does indicate "that at least it wouldn't hurt". And if we were to implement N/exag, this would fit in much more cleanly into the entire openTSNE architecture. I wouldn't mind being slightly inconsistent with FIt-SNE or scikit-learn in this respect, since the visualizations seem visually indistinguishable from one another.
I agree. I ran the same test on MNIST, and observe faster convergence using the suggested approach.
Incidentally, I first did it wrong because I did not specify correct momentum terms for the two optimize() calls. This made me think that I am not aware of any reason for why momentum should be different. I tried using momentum=0.8 for both stages, and it seems to be better than the current 0.5->0.8 scheme.
Note that your criticism that TSNE().fit()
and twice calling embedding.optimize(...)
is not identical, also applies to momentum differences, no?
Note that your criticism that TSNE().fit() and twice calling embedding.optimize(...) is not identical, also applies to momentum differences, no?
Yes, this is the same issue, and this has bothered me from the very start. So I'm very happy to see that using a single momentum seems to lead to faster convergence, as this would justify defaulting to 0.8 everywhere.
I'm not aware of any justification for this choice either, this came from the original LVM and I never bothered to question it.
I see similar behaviour on the macosko dataset:
It seems that 250 iterations may be too many with this momentum setting, but maybe let's not touch the n_iter defaults for now.
Would be good to check this on a couple of more datasets, perhaps also very small ones (Iris?), but overall I think it looks good.
It seems that 250 iterations may be too many with this momentum setting, but maybe let's not touch the n_iter defaults for now.
Yes, I agree.
Would be good to check this on a couple of more datasets, perhaps also very small ones (Iris?), but overall I think it looks good.
I'd also check a couple of big ones, the cao one, maybe the 10x mouse as well. It might also be interesting to see if we actually need lr=200 on iris. Maybe lr=N=150 would be better. The 200 now seems kind of arbitrary.
Iris:
I'd also check a couple of big ones, the cao one, maybe the 10x mouse as well. It might also be interesting to see if we actually need lr=200 on iris. Maybe lr=N=150 would be better. The 200 now seems kind of arbitrary.
Very good point. Red line shows that turning off learning rate "clipping" (I mean clipping to 200) works actually very good.
That's great! I think we should test it for even smaller data sets, but this indicates that we can get rid of the 200 altogether.
Are you going to run it on smth with sample size over 1mln? Sounds like you have everything set up for these experiments. But if you want, I can run something as well.
Yes, sure, I'll find a few more data sets and run them. If everything goes well, we'll change the defaults to momentum=0.8 and lr=N/exaggeration, and this will solve all the issues outlined above.
This may be worth bumping to 0.7!
Have you had a chance to run it on some other datasets? Otherwise I would give it a try on something large, I am curious :)
Hey Dmitry, no, unfortunately, I haven't had time yet. It's a busy semester for teaching :) If you have any benchmarks you'd like to run, I'd be happy to see the results.
Just ran it in an identical way for Iris, MNIST, and n=1.3 mln dataset from 10x. I used uniform affinities with k=10 in all cases, to speed things up.
Old: current default New: learning rate N/12, followed by learning rate N, momentum always .8, learning rate below 200 is allowed
I think everything is very consistent:
I've also tried this on two other data sets: shekhar (20k) and cao (2mln):
shekhar (20k):
cao (2mln):
In both cases, using momentum=0.8
and learning_rate=N/exaggeration
works better. Along with the the other examples you provide, I feel this is sufficient to change the defaults.
Learning rate below 200 for small datasets prevents fluctuations in loss
I don't understand this entirely. So, can we use learning_rate=N/exaggeration
in all cases? Or do we keep it as it is right now and use learning_rate=max(N/exaggeration, 200)
?
I don't understand this entirely. So, can we use learning_rate=N/exaggeration in all cases? Or do we keep it as it is right now and use learning_rate=max(N/exaggeration, 200)?
I now think that we should use learning_rate=N/exaggeration
in all cases. For iris (N=150), this would mean learning rate 150/12 = 12.5 during the early exaggeration phase. Currently we use learning rate 200 during that phase, which is too high. It does not lead to divergence (not sure why; maybe due to some gradient or step size clipping?) but does lead to oscillating loss, clearly suggesting that something is not well with the gradient descent. Learning rate 12.5 seems much more stable.
Yes, I agree. I tried it myself and there are big oscillations. E.g. subsampling iris to 50 data points also leads to less oscillation with learning_rate=N/exaggeration
.
I think the best course of action is to add a learning_rate="auto"
option, make that the default, and then handle it in _handle_nice_params
, as I've written in the original code review.
Do you want me to edit this PR, or do you prefer to make a new one?
I think the best course of action is to add a learning_rate="auto" option, make that the default, and then handle it in _handle_nice_params, as I've written in the original code review.
Is it okay to make _handle_nice_params
take the exaggeration value as input? Currently it does not.
I think adding it to this PR is completely fine.
Is it okay to make _handle_nice_params take the exaggeration value as input? Currently it does not.
I think it does. The exaggeration factor should come in through .optimize
, and should be captured in the **gradient_descent_params
. Then, this should be passed through here.
But exaggeration factor does not really feel like a "gradient descent parameter"... So I was reluctant to add it into the gradient_descent_params
. What about passing it into _handle_nice_params()
as an additional separate input parameter? Like this:
def _handle_nice_params(embedding: np.ndarray, exaggeration: double, optim_params: dict) -> None:
Edit: sorry, misread your comment. If it is already passed in as you said, then there is of course no need to change it.
Edit2: but actually, looking at how gradient_descent_params
is created, it seems that exaggeration is not included:
https://github.com/pavlin-policar/openTSNE/blob/46d65aec8e299f1152511004e8efa34c823510af/openTSNE/tsne.py#L1399
My understanding is that it is already included in the _handle_nice_params
. Indeed, if I run a simple example and print the optim_params
in _handle_nice_params
, I get
{'learning_rate': 'auto', 'momentum': 0.5, 'theta': 0.5, 'max_grad_norm': None, 'max_step_norm': 5,
'n_jobs': 1, 'verbose': False, 'callbacks': None, 'callbacks_every_iters': 50,
'negative_gradient_method': 'bh', 'n_interpolation_points': 3, 'min_num_intervals': 50,
'ints_in_interval': 1, 'dof': 1, 'exaggeration': 12, 'n_iter': 25}
for the EE phase, and
{'learning_rate': 'auto', 'momentum': 0.8, 'theta': 0.5, 'max_grad_norm': None, 'max_step_norm': 5,
'n_jobs': 1, 'verbose': False, 'callbacks': None, 'callbacks_every_iters': 50,
'negative_gradient_method': 'bh', 'n_interpolation_points': 3, 'min_num_intervals': 50,
'ints_in_interval': 1, 'dof': 1, 'exaggeration': None, 'n_iter': 50}
for the standard phase. Importantly, exaggeration
and the learning_rate
is already among them. So I would imagine something like this would be totally fine:
learning_rate = optim_params["learning_rate"]
if learning_rate == "auto":
exaggeration = optim_params.get("exaggeration", None)
if exaggeration is None:
exaggeration = 1
learning_rate = n_samples / exaggeration
optim_params["learning_rate"] = learning_rate
I see. Still not quite sure where it gets added to the dictionary, but it does not matter now.
I did the changes and added a test that checks if running optimize()
twice (with exaggeration 12 and then without) produces the same result as running fit()
with default params.
I think this is fine now. I'm glad we found a way to simplify the API and speed up convergence at the same time :)
Implements #218.
First,
early_exaggeration="auto"
is now set tomax(12, exaggeration)
.Second, the learning rate. We have various functions that currently take
learning_rate="auto"
and set it tomax(200, N/12)
. I did not change this, because those functions usually do not know what the early exaggeration was. So I kept it as is. I only changed the behaviour of the base class: therelearning_rate="auto"
is now set tomax(200, N/early_exaggeration)
.This works as intended:
(Note that the learning rate is currently not printed by the
repr(self)
because it's kept as "auto" at construction time and only set later. That's also how we had it before.)