Why does transform() have exaggeration=2 by default?

pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE

https://opentsne.rtfd.io

BSD 3-Clause "New" or "Revised" License

1.47k stars 163 forks source link

Why does transform() have exaggeration=2 by default? #62

Closed dkobak closed 5 years ago

dkobak commented 5 years ago

The parameters of the transform function are

def transform(self, X, perplexity=5, initialization="median", k=25,
learning_rate=100, n_iter=100, exaggeration=2, momentum=0, max_grad_norm=0.05):

so it has exaggeration=2 by default. Why? This looks unintuitive to me: exaggeration is a slightly "weird" trick that can arguably be very useful for huge data sets, but I would expect the out-of-sample embedding to work just fine with it. Am I missing something?

I am also curious why momentum is set to 0 (unlike in normal tSNE optimization), but here I don't have any intuition for what it should be.

Another question is: will this function work with n_iter=0 if one just wants to get an embedding using medians of k nearest neighbours? That would be handy. Or is there another way to get this? Perhaps from prepare_partial?

And lastly, when transform() is applied to points from a very different data set (imagine positioning Smart-seq2 cells onto a 10x Chromium reference), I prefer to use correlation distances because I suspect Euclidean distances might be completely off (even when the original tSNE was done using Euclidean distances). I think openTSNE currently does not support this, right? Did you have any problems with that? One could perhaps allow transform() to take a metric argument (is correlation among the supported metrics, btw?). The downside is that if this metric is different from the metric used to prepare the embedding, then the nearest neighbours object will have to be recomputed, so it will suddenly become much slower. Let me know if I should post it as a separate issue.

pavlin-policar commented 5 years ago

So the rationale for using a higher exaggeration here is the same as why we use the "early exaggeration phase". If the initialization for the transformed points is not ideal, then having a slightly higher exaggeration would allow the new points to move more easily to their "correct" locations. But instead of using exaggeration 12, I found 2 to be completely fine. I figured that I could cut down the number of optimization iterations by a little bit, making transform faster.

When we do median initialization, the new point positions are basically on top of existing points, making the repulsive forces quite large. I found that using the interpolation based approximation would sometimes shoot points way off 1) making it slow because the interpolation grid would become huge and 2) it would take these points a lot of iterations to return to the main mass of points. This is why I set momentum to 0, so as to dampen the shooting off in case it does happen, and also why I added gradient clipping in transform.

I admit this was based mainly on intuition and like one hour of tinkering around, so I am by no means dead-set on these defaults.

will this function work with n_iter=0?

Yes, it works. transform calls prepare_partial under the hood, so you can also call that directly. This returns a PartialTSNEEmbedding instance, which you can then optimize if you want.

As for your last question, I usually use cosine distances then dealing with sc data, it acts as some kind of "rank" normalization, and I find it usually gives more well defined clusters than Euclidean distance. Actually, pynndescent is quite good in this regard and it supports a number of distance metrics (see here), including correlation.

For your specific example, correlation might probably be better since full-length sequencing data tends to be quite different from UMI based methods. That would be a cool thing to test out actually. But typically when I do things, I usually do PCA first, then run t-SNE on that with cosine distance. How exactly does your case work? Do you run t-SNE on the entire data set? Or maybe do gene filtering? Or do you compute PCA on e.g. the smart-seq2 data then project the 10x data onto that basis (I imagine you might fail to capture some information that way?).

And yes, you're right. Currently, we don't support changing the distance metric between the initial embedding and the "transformed" points in that way, because it could be very slow. I'd have to think about it but I may be reluctant to just add a metric parameter to transform because that would make it very easy to do this. And if something is very easy to do, people will do it more, then I might get people constantly complaining that this is very slow. And also I'm not entirely sure why this would make sense.

You can actually achieve this by manually recomputing the nearest neighbors with the new distance metric and then reassigning the embedding.affinity field. Like this:

tsne = openTSNE.TSNE(metric="euclidean")
embedding = tsne.fit(x_train)

# Set the new metric
embedding.affinities = openTSNE.affinity.PerplexityBasedNN(x_train, metric="correlation")

# Do transform with the new metric
new_embedding = embedding.transform(x_new)

That's not too bad, but it does require being familiar with some of the internals. Maybe putting this as an example somewhere about more advanced use cases would suffice? I feel like this would rarely come up, and if it does, I'd expect the person to really know what they're doing.

dkobak commented 5 years ago

Thanks for the reply! There are two separate issues here.

First, I find exaggeration=2 confusing. I think it's okay to use early exaggeration for transform() and I also think it's fine to use different optimization parameter settings for transform() compared to standard t-SNE (learning_rate=100, n_iter=100, momentum=0, max_grad_norm=0.05). However, I would expect the final positioning to correspond to the t-SNE loss, which does not include exaggeration. So I would expect for the early exaggeration to turn off at some point. Unless there is a good reason not to do it.

So my suggestion would be to have something like exaggeration=None, early_exaggeration=2, early_exaggeration_iter=50 as default parameters to transform().

But typically when I do things, I usually do PCA first, then run t-SNE on that with cosine distance. How exactly does your case work? Do you run t-SNE on the entire data set? Or maybe do gene filtering? Or do you compute PCA on e.g. the smart-seq2 data then project the 10x data onto that basis (I imagine you might fail to capture some information that way?).

For concreteness, let's say I have 10x reference with n=100k cells and n=100 smart-seq2 points to map on top of that. For the 10x data I would usually do feature selection to 1000 genes, then PCA down to 50 components, then t-SNE with Euclidean distance (I don't mind cosine distance, but I haven't experimented with it myself, so I'm just going with Euclidean by default). Now for the smart-seq cells, I select the same 1000 genes and use correlation distance to find k=10 neighbours in the 10x data and use the median location to position the point. That's the procedure I describe in my preprint. (I don't use 10x reference in the preprint, but I use it in some parallel work and it works fine.)

To me it makes sense: I don't necessarily want to project the smart-seq points onto the reference PC space because I have no idea how the batch effect will look like... I have to admit that I did not really experiment with other ways of doing it, but I found the correlation distance to work reasonably well in this situation.

Now for the revision I wanted to use openTNSE to compare what happens if I let it optimize the location of points... Now thinking about it, this would actually mean to change the reference X itself, not only the metric... Because the tSNE on the original data was computed using 100000x50 matrix (after PCA) but here for the transform I want to use 100000x1000 matrix before the PCA, and correlation distance on that...

How would you do it instead?

pavlin-policar commented 5 years ago

So my suggestion would be to have something like exaggeration=None, early_exaggeration=2, early_exaggeration_iter=50 as default parameters to transform()

I think I actually had it like that at some point, but then changed it so it was simpler. I'd have nothing against that. These parameters were chosen by running it on a few examples and seeing what worked and what didn't. It wasn't very systematic, so I'm open to changes. I wasn't entirely sure how to evaluate if it was doing "well". How do you define "well"? So I just looked at the plots and looked at what seemed reasonable.

I'll do a little bit of tinkering, but I think just changing the exaggeration back to 1 and running it for the full 100 iterations should be fine. The reason being I want to avoid too many parameters. t-SNE already has a very large number of parameters, and cutting any of them out makes it a bit simpler to use. If you've done (or plan to do) any systematic testing, that would be even better.

... Now for the smart-seq cells, I select the same 1000 genes and use correlation distance to find k=10 neighbours in the 10x data and use the median location to position the point. ...

Oh yeah, I remember now. That probably makes the most sense of all the approaches I've tried. I've found Euclidean distance is useless when dealing with batch effects, so I never use that. Would it be reasonable to use correlation distance also for the initial embedding? Does PCA interfere with this?

Now for the revision I wanted to use openTNSE to compare what happens if I let it optimize the location of points... Now thinking about it, this would actually mean to change the reference X itself, not only the metric... Because the tSNE on the original data was computed using 100000x50 matrix (after PCA) but here for the transform I want to use 100000x1000 matrix before the PCA, and correlation distance on that...

This should still be possible with openTSNE, using almost the exact same snippet as before:

tsne = openTSNE.TSNE(metric="euclidean")
embedding = tsne.fit(x_train_pca)

# Set the new metric
embedding.affinities = openTSNE.affinity.PerplexityBasedNN(x_train_genes, metric="correlation")

# Do transform with the new metric
new_embedding = embedding.transform(x_new_genes)

If this turns out to work well, then maybe adding a metric parameter to transform wouldn't be a bad idea at all.

dkobak commented 5 years ago

I'll do a little bit of tinkering, but I think just changing the exaggeration back to 1 and running it for the full 100 iterations should be fine. The reason being I want to avoid too many parameters.

That's fine, but I personally would think that having early exaggeration in transform() would be okay and rather consistent with the standard t-SNE optimization. In fact, not having early exaggeration might be seen as an inconsistent choice.

Would it be reasonable to use correlation distance also for the initial embedding? Does PCA interfere with this?

I've never tried but I would guess that yes, one could use correlation distance for the initial embedding. I don't see how one could combine it with PCA though: correlation is supposed to be across genes, not across PCs, and definitely not across only 50 PCs. So with correlation distance I think one would need to use 1000-dimensional (or so) space.

This should still be possible with openTSNE, using almost the exact same snippet as before:

Thanks! The snippet makes sense. I want to try it some time later this week or maybe next week, and will get back to you. Then we can discuss further.

dkobak commented 5 years ago

Apologies, I haven't got around trying this out so far, but it's high on my to-do list :-)

pavlin-policar commented 5 years ago

No worries! I've been playing around with this a little bit, but it seems quite finicky.

dkobak commented 5 years ago

By the way, we have recently submitted the revision of my tSNE paper (https://www.biorxiv.org/content/10.1101/453449v2). I am still using FIt-SNE for all the analysis, but I explicitly mention openTSNE as an alternative. You might want to check if you think it's mentioned appropriately (suggestions welcome). Also, I am thinking of maybe adding some openTSNE examples to the demo notebook at https://github.com/berenslab/rna-seq-tsne/ -- what do you think?

As a side note, the manuscript has changed quite a bit in response to the critical comments we received. I had to streamline and "unify" the procedure, so now there is a specific choice of parameters etc. that we are suggesting (at least as a rule of thumb). Also, the visualization of the Macosko data set now looks a bit different (and unfortunately less striking), mostly because I realized that the "heart" shape was an artifact of my normalization procedure: I was normalizing to counts per million and this does not make much sense for the UMI data. I am now normalizing to counts per median library size (which for UMI datasets is a lot less than 1mln).

As always, any comments on the preprint are very welcome, especially while we can still improve it.

pavlin-policar commented 5 years ago

Oh, cool, I'll definitely give it a read! I glanced at it a little bit and the figures are really nice. Thanks for mentioning openTSNE - it should be fine for now. We'll be pushing out a short report on openTSNE soon, but I don't know how long that will take.

Also, I am thinking of maybe adding some openTSNE examples to the demo notebook

That would definitely be welcome.

I noticed that you recommend a learning rate n/12 for larger data sets. That seems high for data sets with millions of points, and I had some issues with high learning rates with some points shooting way off. I need to look into this...

Also, the visualization of the Macosko data set now looks a bit different

I had noticed that as well, but CPM is pretty standard, and normalizing to median library size when working with multiple data sets might not work.

I also noticed that you have three couple of evaluation metrics for embeddings: that's neat. The three numbers each describe one aspect, but it might be easier to reason about a single number. KNC is easy to compute if we have the class labels. I suppose if there are no labels, you'd do some clustering to get them?

dkobak commented 5 years ago

I noticed that you recommend a learning rate n/12 for larger data sets. That seems high for data sets with millions of points, and I had some issues with high learning rates with some points shooting way off. I need to look into this...

I never had any problems with n/12, including the Cao dataset with n=2mln. The recommendation is taken from the Belkina et al. preprint, and they have n up to 20 mln. It seems always to work fine.

I had noticed that as well, but CPM is pretty standard

Not really. It's standard for full length sequencing, but most UMI papers that I see normalize either to 10,000 or to median. I think median became pretty standard.

KNC is easy to compute if we have the class labels. I suppose if there are no labels, you'd do some clustering to get them?

Yeah... It's clearly not perfect. I plan to work on this in the future. These particular metrics were a rather quick fix to please the reviewers :-)

pavlin-policar commented 5 years ago

Not really. It's standard for full length sequencing, but most UMI papers that I see normalize either to 10,000 or to median.

Yeah, but what would you use to compare a full-length and a UMI data set?

Yeah... It's clearly not perfect. I plan to work on this in the future.

Following this line of thought, we're currently using Moran's I correlation to measure how "good" an embedding is. It's useful because we can also use it to evaluate the "transformed" data points. It has a lot of problems on its own, but it does tell us how well-clustered points of the same class are and it does assign a single numeric value to each embedding.

dkobak commented 5 years ago

Yeah, but what would you use to compare a full-length and a UMI data set?

In what sense "to compare"? Anyway, I'd probably normalize them separately, each to the median within the corresponding data set.

we're currently using Moran's I correlation to measure how "good" an embedding is

Interesting! I've never heard about it. Let me know if/when you have some write-ups about it.

pavlin-policar commented 5 years ago

In what sense "to compare"?

I'm working on an example where I've got a data set of full-length reads and a UMI count data set. Now if I want to make a t-SNE embedding of both these data sets together i.e. just concatenate them, I need to get them on a similar scale, so I use CPM. Or would you prefer median normalization here?

Interesting! I've never heard about it.

The idea actually came from Monocle 3 where they use it for detecting DE genes in developmental trajectories. It's actually really neat! But instead of doing a statistical test, we just compute the autocorrelation coefficient.

dkobak commented 5 years ago

Now if I want to make a t-SNE embedding of both these data sets together i.e. just concatenate them

I don't really have a lot of experience with that. But I'd expect to see a very strong batch effect, independent of what kind of normalization one does.

dkobak commented 5 years ago

I tried it out and it worked. Thanks! I managed to confirm that

pos1 = embedding.transform(logCountsCadwell, k=10, n_iter=0)
pos2 = embedding.transform(logCountsCadwell, k=10)

yield very similar results for my data :-)

It wasn't super practical though. I used the same datasets and the same preprocessing as in the preprint: reference data set with 24k cells and 3000 selected genes, test dataset with 46 cells. When I ran openTSNE.affinity.PerplexityBasedNN(logCountsTasic, metric='correlation', n_jobs=-1), it took 10 minutes on my computer (interestingly for most of the time only one core was busy, despite n_jobs set to all cores; I guess the version of pynndescent that you are using here is not yet multithreaded?) In comparison, direct calculation of the full 24k times 46 matrix of all correlations takes less than a second... which is what I've been using so far.

pavlin-policar commented 5 years ago

Yes, unfortunately, I still rely on pynndescent to do this. Even still, that's surprisingly slow. Just yesterday, I was generating figures, and the reference embedding had 44k points, the new data 28k points. The entire process including optimization took like 5mins. Granted, I used 1000 genes instead of 3000. So it's very surprising that yours took so long.

The algorithm is actually pretty fast on a single core, but the implementation doesn't know how to take advantage of all available cores. Eventually, I plan to get rid of it, but I have yet to find a suitable replacement. I may just end up rolling out my own implementation...

dkobak commented 5 years ago

I've tried it with 1000 genes now and PerplexityBasedNN(logCountsTasic, metric='correlation', n_jobs=-1) took just a little over 1 minute. But with 3000 genes, it takes over 10 minutes. I agree, it is strange that the slowdown is so strong, given that I think UMAP has examples processing data with millions of features (directly in sparse format).

But still, direct calculation of all correlations takes 0.1 s. I guess when the test dataset is very small, then it's faster than approximate nearest neighbours.

The algorithm is actually pretty fast on a single core, but the implementation doesn't know how to take advantage of all available cores. Eventually, I plan to get rid of it, but I have yet to find a suitable replacement.

Doesn't the current version of pynndescent have multi-core support? I thought it was added recently.

pavlin-policar commented 5 years ago

Doesn't the current version of pynndescent have multi-core support? I thought it was added recently.

I think they're working on it. The last time I checked (a couple weeks ago) it didn't seem to be fully integrated yet. Same for sparse.

The main issue is that we can't just depend on it i.e. throw it into the requirements.txt, because conda-forge doesn't allow that. If I want to add a package to the requirements, it needs to be on conda-forge, which pynndescent isn't. This means that I basically need to keep a copy of pynndescent inside this repo. Which is less than ideal.

dkobak commented 5 years ago

The last time I checked (a couple weeks ago) it didn't seem to be fully integrated yet. Same for sparse.

Hmm. My impression was different. Multi-threaded implementation seems to have been merged in March: https://github.com/lmcinnes/pynndescent/pull/12, and sparse matrix support in April https://github.com/lmcinnes/pynndescent/pull/16.

If I want to add a package to the requirements, it needs to be on conda-forge, which pynndescent isn't. This means that I basically need to keep a copy of pynndescent inside this repo. Which is less than ideal.

Yes, I understand that. However, you already have a frozen copy of pynndescent here :) Even though it's less than ideal, it's clearly better to have a frozen copy of a more recent version rather than of an outdated one.

I guess the question is, how easy it is to update the copy of pynndescent. Does it require a lot of work?

pavlin-policar commented 5 years ago

Oh, yeah, you're right. You mentioned this in #28. I think I did try to port that code over here, but I seem to have dropped it for some reason. Maybe something wasn't working properly at the time. There seems to be quite a bit of activity there. From my experience, if something's on the master branch and under a lot of development, there are bound to be bugs here and there.

According to pypi, the latest release was 10 days ago, and it looks like they're working to get it merged into scanpy (https://github.com/theislab/scanpy/pull/659), so it must be fairly stable at this point.

You're entirely right and it's probably not difficult, but it needs to be tested. I have written some tests that will catch obvious errors, but "correctness" is harder to put into unit tests. You're welcome to take a crack at it, otherwise, I'll try to get to it in the coming weeks.

dkobak commented 5 years ago

I think this can be closed now, so I'm closing it.

pavlin-policar commented 5 years ago

@dkobak, just wanted to let you know we've put out a preprint on biorxiv, which basically puts out our idea of the transform functionality. It draws heavily on your manuscript "art of using t-SNE" and on our discussions, so I hope we've attributed you properly.

dkobak commented 5 years ago

That's great. Good luck with publication! I will insert the citation when we revise our paper next time (currently waiting to hear back from the journal after a major revision).

I read it now, so here are some very brief comments:

Eq (1) denominator: k \ne 1 instead of "i"
Eq (8) denominator: small $d$
"by adapting the same polynomial interpolation based approximation, this is reduced to O(N)" -- why O(N)? Shouldn't it be O(M)? Not sure I understand.
Momentum clipping that you mentioned here is not mentioned in the paper; I think it's totally worth listing all optimisation tricks that you use.
As far as I understood, you use cosine distance on 50 PCs for the initial embedding and cosine distance on 1000 genes for the transform(). So does it mean that you are replacing the distance metric for transform(), as we previously discussed here? If yes, do you plan to implement it more natively in your code?
BTW, it's not mentioned how exactly you select these 1000 genes.
There is some analysis of what Shekhar bipolar clusters correspond to what Macosko bipolar clusters in the Shekhar paper. You could use it for additional "validation".
Figures 5 and 6 look rather "supplementary" to me because they are not about transform(). I had a feeling they somewhat break the flow of the paper.
Figure 7 was the most interesting for me personally. It would be great to see a more detailed comparison of panels A vs B... Just looking at the figure, it is not obvious that B is "better". I am not sure there is a straightforward way to argue that it really is, but it'd be great to see at least some analysis of this. E.g. some Shekhar cells in B are in the "white areas" between Macosko clusters. Can one say something about them and somehow argue that it really is more appropriate to place them there?..

pavlin-policar commented 5 years ago

Thanks for the awesome feedback, we'll be sure to consider this.

why O(N)? Shouldn't it be O(M)? Not sure I understand.

Actually, this should be O(max{N, M}), but in the cases I worked on, the reference is larger. So the reason it's this is that we're interpolating from the reference data (which adds up to O(N)) to the secondary data (so O(M)). The matrix multiplication is still done through the FFT and the interpolation points don't depend on the number of samples. Hence O(max{N, M}).

Momentum clipping that you mentioned here is not mentioned in the paper; I think it's totally worth listing all optimisation tricks that you use.

The reason I didn't mention it is that lowering the learning rate kind of removes the need for that. I still keep it on for safety, but it doesn't usually contribute much. I suppose it's worth mentioning.

As far as I understood, you use cosine distance on 50 PCs for the initial embedding and cosine distance on 1000 genes for the transform(). So does it mean that you are replacing the distance metric for transform(), as we previously discussed here?

Yes, so we use the cosine distance in both cases, and that's exactly what it means. Cosine distance is very similar to correlation and works better in higher dimensional spaces. In the embedding space, we still use euclidean distance to estimate the gradients. But that's in 2d, and euclidean makes a lot of sense there.

If yes, do you plan to implement it more natively in your code?

I'm not sure I follow, to compute any similarities, we rely on pynndescent, which includes both euclidean and cosine distances. So I guess they're both treated as equivalent.

I'd argue that the optimization is important. The reason being that median initialization will usually put points straight into the middle of some clusters. Running the optimization for a bit will consider more neighbors, and the distances between points will be more closely related to what we see in the reference embedding. Some points drift away from the clusters after optimization, but this could indicate that these points are just a bit different from what we see in the reference. Or, in Fig. 7, bipolar cells are placed kind of between bipolar cells and the light blue ones, and optimization does a good job of clustering them together.

Obviously, it's hard to argue which is better from a visual perspective, and I'll likely look into it further.

In any case, it seems more intuitive that if we make a reference t-SNE embedding and want to add points to it, we'd expect the new points to optimized in the same way. And we did have to mention a few things fairly briefly because we hasd a hard 15-page limit.

Again, thanks for the feedback.

Also, just a couple things I noticed while going over your paper that you may want to fix in your next revision :)

the inconsisten spelling of preprocessing or pre-processing
the equation for m_g in feature selection has \mid X_ij > t - you probably want \mid X_ig>t ie X_ij -> X_ig
and on p18 in the first paragraph, you misspelt "order" once to "oder".

dkobak commented 5 years ago

I'm not sure I follow

What I meant is that to implement your pipeline in openTNSE one needs to do something like this (as you showed me earlier in this thread):

embedding.affinities = openTSNE.affinity.PerplexityBasedNN(x_train_genes, metric="cosine")
new_embedding = embedding.transform(x_new_genes)

which is a pretty advanced usage because one needs to "manually" change the affinities object after constructing the embedding and before running transform(). So I was wondering if could be worthwhile to add extra parameters to transform(). However, thinking about it now, I am not sure what's the best way to set it up: one would need to pass new training data (x_train_genes) into transform() and this is arguably also cumbersome.

I'd argue that the optimization is important. [...] Some points drift away from the clusters after optimization, but this could indicate that these points are just a bit different from what we see in the reference.

I am not saying it isn't important, I'm just saying that ideally I would like to see a more explicit demonstration of this... But maybe the only way to do it convincingly is to generate some kind of synthetic toy dataset with known ground truth.

Also, just a couple things I noticed while going over your paper that you may want to fix in your next revision :)

Thanks a lot! All good catches.

pavlin-policar commented 5 years ago

Oh, I see! Yes, it a bit annoying to do, and I set up a notebook demonstrating how to do this (here)... I've given this quite a bit of thought, and I don't think there's an elegant way to incorporate this into transform. openTSNE is a general purpose t-SNE library, and cluttering the API with single-cell specific workflows seems like a really bad idea. First, because a lot of people don't do single-cell and second, because too many parameters can be a bit overwhelming. Having to write out your code snippet each time doesn't seem too harsh, it's two lines, and it makes it very explicit about what you're doing and makes it very easy to adjust the parameters if needed.

I think having a good example notebook that is easy to find should be sufficient.

But maybe the only way to do it convincingly is to generate some kind of synthetic toy dataset with known ground truth.

That's a great idea, but I'll have to put a lot of thought into how to design a compelling experiment. I like the figure in the paper quite a bit, but I realise that may be just because I've been looking at it for so long. A synthetic example might be more compelling.