Implement auto early exaggeration

dkobak commented 1 year ago

Implements #218.

First, early_exaggeration="auto" is now set to max(12, exaggeration).

Second, the learning rate. We have various functions that currently take learning_rate="auto" and set it to max(200, N/12). I did not change this, because those functions usually do not know what the early exaggeration was. So I kept it as is. I only changed the behaviour of the base class: there learning_rate="auto" is now set to max(200, N/early_exaggeration).

This works as intended:

X = np.random.randn(10000,10)

TSNE(verbose=True).fit(X)
# Prints 
# TSNE(early_exaggeration=12, verbose=True)
# Uses lr=833.33

TSNE(verbose=True, exaggeration=5).fit(X)
# Prints
# TSNE(early_exaggeration=12, exaggeration=5, verbose=True)
# Uses lr=833.33

TSNE(verbose=True, exaggeration=20).fit(X)
# Prints
# TSNE(early_exaggeration=20, exaggeration=20, verbose=True)
# Uses lr=500.00

(Note that the learning rate is currently not printed by the repr(self) because it's kept as "auto" at construction time and only set later. That's also how we had it before.)

pavlin-policar commented 1 year ago

To build on top of https://github.com/pavlin-policar/openTSNE/pull/220#discussion_r1017919909, I've run a quick experiment to see convergence rates using lr/exaggeration, and the results indicate that N/exag may actually lead to better and faster convergence than simply dividing by N/ee_rate:

Visually, all four embeddings look pretty similar, and I wouldn't say ones are less converged than the others, but it seems like we can get the same KL divergence faster this way. What do you think?

dkobak commented 1 year ago

Does exag in this plot mean early exaggeration? And late exaggeration is set to 1? Is this MNIST?

pavlin-policar commented 1 year ago

Does exag in this plot mean early exaggeration?

Yes, the "regular" phase is run with exaggeration=1

Is this MNIST?

No, this is my typical macosko example.

This is one example, so we'd definitely have to check it on several more, but this does indicate "that at least it wouldn't hurt". And if we were to implement N/exag, this would fit in much more cleanly into the entire openTSNE architecture. I wouldn't mind being slightly inconsistent with FIt-SNE or scikit-learn in this respect, since the visualizations seem visually indistinguishable from one another.

dkobak commented 1 year ago

I agree. I ran the same test on MNIST, and observe faster convergence using the suggested approach.

Incidentally, I first did it wrong because I did not specify correct momentum terms for the two optimize() calls. This made me think that I am not aware of any reason for why momentum should be different. I tried using momentum=0.8 for both stages, and it seems to be better than the current 0.5->0.8 scheme.

Note that your criticism that TSNE().fit() and twice calling embedding.optimize(...) is not identical, also applies to momentum differences, no?

optimization-learing-rate

pavlin-policar commented 1 year ago

Note that your criticism that TSNE().fit() and twice calling embedding.optimize(...) is not identical, also applies to momentum differences, no?

Yes, this is the same issue, and this has bothered me from the very start. So I'm very happy to see that using a single momentum seems to lead to faster convergence, as this would justify defaulting to 0.8 everywhere.

I'm not aware of any justification for this choice either, this came from the original LVM and I never bothered to question it.

I see similar behaviour on the macosko dataset:

dkobak commented 1 year ago

It seems that 250 iterations may be too many with this momentum setting, but maybe let's not touch the n_iter defaults for now.

Would be good to check this on a couple of more datasets, perhaps also very small ones (Iris?), but overall I think it looks good.

pavlin-policar commented 1 year ago

It seems that 250 iterations may be too many with this momentum setting, but maybe let's not touch the n_iter defaults for now.

Yes, I agree.

Would be good to check this on a couple of more datasets, perhaps also very small ones (Iris?), but overall I think it looks good.

I'd also check a couple of big ones, the cao one, maybe the 10x mouse as well. It might also be interesting to see if we actually need lr=200 on iris. Maybe lr=N=150 would be better. The 200 now seems kind of arbitrary.

dkobak commented 1 year ago

Iris:

optimization-learing-rate-iris

I'd also check a couple of big ones, the cao one, maybe the 10x mouse as well. It might also be interesting to see if we actually need lr=200 on iris. Maybe lr=N=150 would be better. The 200 now seems kind of arbitrary.

Very good point. Red line shows that turning off learning rate "clipping" (I mean clipping to 200) works actually very good.

pavlin-policar commented 1 year ago

That's great! I think we should test it for even smaller data sets, but this indicates that we can get rid of the 200 altogether.

dkobak commented 1 year ago

Are you going to run it on smth with sample size over 1mln? Sounds like you have everything set up for these experiments. But if you want, I can run something as well.

pavlin-policar commented 1 year ago

Yes, sure, I'll find a few more data sets and run them. If everything goes well, we'll change the defaults to momentum=0.8 and lr=N/exaggeration, and this will solve all the issues outlined above.

dkobak commented 1 year ago

This may be worth bumping to 0.7!

dkobak commented 1 year ago

Have you had a chance to run it on some other datasets? Otherwise I would give it a try on something large, I am curious :)

pavlin-policar commented 1 year ago

Hey Dmitry, no, unfortunately, I haven't had time yet. It's a busy semester for teaching :) If you have any benchmarks you'd like to run, I'd be happy to see the results.

dkobak commented 1 year ago

Just ran it in an identical way for Iris, MNIST, and n=1.3 mln dataset from 10x. I used uniform affinities with k=10 in all cases, to speed things up.

Old: current default New: learning rate N/12, followed by learning rate N, momentum always .8, learning rate below 200 is allowed

optimization-learing-rate

I think everything is very consistent:

Learning rate N makes convergence in the 2ns phase faster
Momentum .8 makes convergence in the 1st phase faster
Learning rate below 200 for small datasets prevents fluctuations in loss

pavlin-policar commented 1 year ago

I've also tried this on two other data sets: shekhar (20k) and cao (2mln):

shekhar (20k):

cao (2mln):

In both cases, using momentum=0.8 and learning_rate=N/exaggeration works better. Along with the the other examples you provide, I feel this is sufficient to change the defaults.

Learning rate below 200 for small datasets prevents fluctuations in loss

I don't understand this entirely. So, can we use learning_rate=N/exaggeration in all cases? Or do we keep it as it is right now and use learning_rate=max(N/exaggeration, 200)?

dkobak commented 1 year ago

I don't understand this entirely. So, can we use learning_rate=N/exaggeration in all cases? Or do we keep it as it is right now and use learning_rate=max(N/exaggeration, 200)?

I now think that we should use learning_rate=N/exaggeration in all cases. For iris (N=150), this would mean learning rate 150/12 = 12.5 during the early exaggeration phase. Currently we use learning rate 200 during that phase, which is too high. It does not lead to divergence (not sure why; maybe due to some gradient or step size clipping?) but does lead to oscillating loss, clearly suggesting that something is not well with the gradient descent. Learning rate 12.5 seems much more stable.

pavlin-policar commented 1 year ago

Yes, I agree. I tried it myself and there are big oscillations. E.g. subsampling iris to 50 data points also leads to less oscillation with learning_rate=N/exaggeration.

I think the best course of action is to add a learning_rate="auto" option, make that the default, and then handle it in _handle_nice_params, as I've written in the original code review.

dkobak commented 1 year ago

Do you want me to edit this PR, or do you prefer to make a new one?

I think the best course of action is to add a learning_rate="auto" option, make that the default, and then handle it in _handle_nice_params, as I've written in the original code review.

Is it okay to make _handle_nice_params take the exaggeration value as input? Currently it does not.

pavlin-policar commented 1 year ago

I think adding it to this PR is completely fine.

Is it okay to make _handle_nice_params take the exaggeration value as input? Currently it does not.

I think it does. The exaggeration factor should come in through .optimize, and should be captured in the **gradient_descent_params. Then, this should be passed through here.

dkobak commented 1 year ago

But exaggeration factor does not really feel like a "gradient descent parameter"... So I was reluctant to add it into the gradient_descent_params. What about passing it into _handle_nice_params() as an additional separate input parameter? Like this:

def _handle_nice_params(embedding: np.ndarray, exaggeration: double, optim_params: dict) -> None:

Edit: sorry, misread your comment. If it is already passed in as you said, then there is of course no need to change it.

Edit2: but actually, looking at how gradient_descent_params is created, it seems that exaggeration is not included: https://github.com/pavlin-policar/openTSNE/blob/46d65aec8e299f1152511004e8efa34c823510af/openTSNE/tsne.py#L1399

pavlin-policar commented 1 year ago

My understanding is that it is already included in the _handle_nice_params. Indeed, if I run a simple example and print the optim_params in _handle_nice_params, I get

{'learning_rate': 'auto', 'momentum': 0.5, 'theta': 0.5, 'max_grad_norm': None, 'max_step_norm': 5,
'n_jobs': 1, 'verbose': False, 'callbacks': None, 'callbacks_every_iters': 50,
'negative_gradient_method': 'bh', 'n_interpolation_points': 3, 'min_num_intervals': 50,
'ints_in_interval': 1, 'dof': 1, 'exaggeration': 12, 'n_iter': 25}

for the EE phase, and

{'learning_rate': 'auto', 'momentum': 0.8, 'theta': 0.5, 'max_grad_norm': None, 'max_step_norm': 5,
'n_jobs': 1, 'verbose': False, 'callbacks': None, 'callbacks_every_iters': 50,
'negative_gradient_method': 'bh', 'n_interpolation_points': 3, 'min_num_intervals': 50,
'ints_in_interval': 1, 'dof': 1, 'exaggeration': None, 'n_iter': 50}

for the standard phase. Importantly, exaggeration and the learning_rate is already among them. So I would imagine something like this would be totally fine:

learning_rate = optim_params["learning_rate"]
if learning_rate == "auto":
    exaggeration = optim_params.get("exaggeration", None)
    if exaggeration is None:
        exaggeration = 1
    learning_rate = n_samples / exaggeration
optim_params["learning_rate"] = learning_rate

dkobak commented 1 year ago

I see. Still not quite sure where it gets added to the dictionary, but it does not matter now.

I did the changes and added a test that checks if running optimize() twice (with exaggeration 12 and then without) produces the same result as running fit() with default params.

pavlin-policar commented 1 year ago

I think this is fine now. I'm glad we found a way to simplify the API and speed up convergence at the same time :)

pavlin-policar / openTSNE

Implement auto early exaggeration #220