A bunch of comments and questions

pavlin-policar / openTSNE

Extensible, parallel implementations of t-SNE

https://opentsne.rtfd.io

BSD 3-Clause "New" or "Revised" License

1.48k stars 164 forks source link

A bunch of comments and questions #7

Closed dkobak closed 6 years ago

dkobak commented 6 years ago

Hi Pavlin! Great work. I did not know about Orange but I am working with scRNA-seq data myself (cf. your Zeisel2018 example) and I am using Python, so it's interesting to see developments in that direction.

I have a couple of scattered comments/questions that I will just dump here. This isn't a real "issue".

You say that BH is much faster than FFT for smaller datasets. That's interesting; I did not notice this. What kind of numbers are you talking about here? I was under impression that with n<10k both methods are so fast (I guess all 1000 iterations under 1 min?) that the exact time does not really matter...
Any specific reason to use "Python/Numba implementation of nearest neighbor descent" for approximate nearest neighbours? There are some popular libraries, e.g. annoy. Is your implementation much faster than that? Because otherwise it could be easier to use a well-known established library... I think Leland McInnes is using something similar (Numba implementation of nearest neighbor descent) in his UMAP; did you follow him here?
I did not look at the actual code, but from the description on the main page it sounds that you don't have a vanilla t-SNE implementation in here. Is it true? I think it would be nice to have vanilla t-SNE in here too. For datasets with n=1k-2k it's pretty fast and I guess many people would prefer to use vanilla t-SNE if possible.
I noticed you writing this in one of the closed issues:

we allow new data to be added into the existing embedding by direct optimization. To my knowledge, no other library does this. It's sometimes difficult to get nice embeddings like this, but it may have potential.

That's interesting. How exactly are you doing this? You fix the existing embedding, compute all the affinities for the extended dataset (original data + new data) and then optimize the cost by allowing only the positions of the new points to change? Something like that?
George sped up his code quite a bit by adding multithreading to the F_attr computations. He is now implementing multithreading for the repulsive forces too. See https://github.com/KlugerLab/FIt-SNE/pull/32, and the discussion there. This might be interesting for you too. Or are you already using multithreading during gradient descent?
I am guessing that your Zeisel2018 plot is colored using the same 16 "megaclusters" that Zeisel et al. use in Figure 1B (https://www.cell.com/cms/attachment/f1754f20-890c-42f5-aa27-bbb243127883/gr1_lrg.jpg). If so, it would be great if you used the same colors as in their figure; this would ease the comparison. Of course you are not trying to make comparisons here, but this is something that would be interesting to me personally :)

pavlin-policar commented 6 years ago

These are all excellent questions, and I should include a lot of this into the readme, but I've found writing good documentation is quite hard.

I did not know about Orange but I am working with scRNA-seq data myself

Orange actually provides a rich ecosystem of add-ons for text-mining and image embedding to name a few. Over the past year or so, we've been working on a single-cell version - scOrange, which adds a lot of popular techniques from the widely used R seurat package and others. It's not the greatest for dealing with really big data sets as of this moment, but we'll get there.

So I should probably rephrase this since both are fast for smaller data sets, but BH is instantaneous while with FFT we have to wait a couple seconds. I'm sure they both go in under a minute, but since we are incorporating this into Orange, a tool based on interactivity, instant versus a couple seconds matters. Same goes for nearest neighbor search: the exact search is near instantaneous, while the approximate one has some overhead which takes a couple of seconds. On smaller data sets the difference will probably end up being seconds or minutes, but like I said, in an interactive environment, this is important for user experience.
I think I tested a couple libraries for NN search. One criteria I had was easy packaging and ideally, it would have a conda package. I was told C++ makes things hard to package, so a lot of popular libraries (including annoy) fell through there. I don't remember exactly how I came to settle on pynndescent, but I did try several other implementations. I didn't benchmark this much, but it did orders of magnitude better than exact search, so I was satisfied (see MNIST benchmark in readme). And if it turns out there's a better method, we can always just switch it out. Looking through the code, a single line is missing to make this extensible so the user could potentially implement their own NN search method and pass that as an argument, so I'll probably add that. I did notice a few days ago that pynndescent is actually written by Leland McInnes, and he just copied the code over to UMAP, so there must be some merit to it.
That is true, there is no exact tSNE implementation. I don't think this is much of an issue since both BH and FFT approximations produce results that are almost indistinguishable from exact tSNE. The gradient could be computed exactly by setting the BH angle parameter to 0, but there is currently no way to exactly compute the affinity matrix for neighbors. Once I add that one line this will be possible, but, again, I don't think this is really all that important. Have you ever found a case where the exact embedding was much different from an approximate embedding?
Yes, so it's similar to what you describe. I fix the existing embedding, compute the affinities from the new points to the existing points and optimize the new points' positions. I do not consider interactions between the new points. One reason for is that I believe I'd have to rebuild the KNN index if I did this, which could take a long time. The other reason is purely from a usability standpoint - I want to know how new data relates to my existing data - what existing clusters do new data points correspond with. If I have a representative sample in the existing embedding, this should work fine. There are problems with this if we have points that aren't similar to any of the existing ones, and thinking about it now, it may make sense to include those interactions as well. I'll have to play around with this some more...
While benchmarking, I too found that computing the positive forces was the main bottleneck - even with multithreading. Everything in the cython code is multithreaded - where possible and reasonable (sometimes the overhead of running multiple threads makes it slower than single threaded). I would note that since I replaced FFTW with numpy's implementation of FFT (for ease of packaging) the negative forces part has been slower, and I haven't had the time to properly benchmark that. I am hoping to speed this back up by adding the option to use FFTW if it's available on the system, or by looking into Intel's MKL FFT implementation, which looks promising.
I got the Zeisel data set from Linnarsson Lab and they provide several labellings for the cells. I took a lower level of the ontology than the one they show in the figure you provided, so there are actually 30 cell types on my plot. The colors overlap because there's too many. I didn't really compare the plots - I am guessing the plot in the paper uses a bigger perplexity than I did (I just used the default 30), so the telencephalon neurons cluster together more nicely than in my plot. The plot looked neat and I had it at hand, so I included it here. It serves no purpose other than being pretty :smile: and I'll probably replace this one with a bigger data set when I get around to running that.

I believe that several things can still be done to improve this code, and speed the FFT back up - I think I've pretty much optimized the other parts as much as possible. The goal of this package is to be easy to use and easy to install - I found running FIt-SNE quite daunting at first, and then I ran it through the R interface. I really like working with numpy arrays more than through text files.

I haven't really done much with adding new points, but it is sometimes hard to get nice plots. Increasing exaggeration here can help a lot for a couple iterations.

Thank's for your questions and feedback, it's very welcome.

dkobak commented 6 years ago

Thanks a lot for the detailed reply! Everything that you said makes sense.

scOrange looks really cool.
Regarding nearest neighbours, I wasn't aware that pynndescent is packaged as a library on its own and that it's actually faster than annoy according to https://github.com/erikbern/ann-benchmarks. So this is clarified.
Regarding exact t-SNE, "Have you ever found a case where the exact embedding was much different from an approximate embedding?" -- no. But I guess for a small enough dataset many people could still prefer to use vanilla t-SNE just because it makes it somewhat simpler to explain/understand.

What you wrote about projecting new points is very interesting:

I fix the existing embedding, compute the affinities from the new points to the existing points and optimize the new points' positions. I do not consider interactions between the new points.

This is similar to what I have been doing, which is to position each new point at the median position of its e.g. 25 nearest neighbours among the old points. This is very simple because it does not require any optimization at all. What you do is similar but slightly more complicated: you find e.g. 90 nearest neighbours, compute affinities using perplexity 30, and then optimize the location of the new point based on the t-SNE loss. Right? So the new point is repulsed from all old points and attracted to these 90.

I am wondering how different this is in practice. How do you initialize the positions of the new points? If you use something like what I am doing as an initialization (which would be sensible, I guess), then did you look at how the result changes after the optimization?

There are problems with this if we have points that aren't similar to any of the existing ones, and thinking about it now, it may make sense to include those interactions as well. I'll have to play around with this some more...

Yes, but this becomes a very different setting then. I tried it some time ago, and if there is some "batch effect" present, i.e. the new points are substantially different from the old points (e.g. think different RNAseq protocol), then all new points will gather in a cluster on their own, outside of all old clusters. I did not find this very useful.

dkobak commented 6 years ago

P.S. Just noticed that you updated the Zeisel runtime to 2 minutes. This is very impressive for n=160k. Did you actually compare it to https://github.com/KlugerLab/FIt-SNE with the same parameters on the same machine?

pavlin-policar commented 6 years ago

So, regarding exact tSNE - I suppose some people may want to try it, but I still feel that this will very rarely come up. I believe scikit-learn provides exact tSNE, and I don't think that has any performance issues (their BH implementation is notoriously slow), so that's readily available. I may add it at some point, but it really isn't high on my list of priorities.

... So the new point is repulsed from all old points and attracted to these 90.

You're spot on. I tried three types of initialization:

random is more for fun - I don't think it should be used in practice. I wanted to see if we're able to manoeuver the new points around the slew of points in the existing embedding. This works, but it's sometimes hard and requires playing with the exaggeration and learning rate,
weighted mean of k nearest neighbors (this is what you described). This often works quite well and the optimization at the end serves only to refine the positions and doesn't really need to be run for many iterations. The point positions don't typically change much.
PCA - sometimes PCA is used to find a good initial embedding for the training set. This sometimes works, but for sc data specifically, I found it gives terrible results. It smears similar points across the space and tSNE has to group them back together. But on smaller data sets it's fine. If we do this, we can use the same basis to find initial positions for new points as well. This needs to be refined with optimization since the initial points could have moved quite a lot.

I haven't throughoutly compared any of these methods, apart from verifying I can get reasonable results. I think it's safe to say that using the weighted mean is probably the best of the three options.

Our idea was that we could likely get tSNE to run faster if we embed only a subset of points then add the rest of the data set in afterwards, using this direct optimization technique. This was before I ran across George Linderman's paper. With his method, things run so fast that there's no need for this trickery.

Yeah, I've been playing around with how to deal with batch effects, but it's a hard problem. Like you said, they cluster completely separately. I verified this with UMAP and get similar results.

Just noticed that you updated the Zeisel runtime to 2 minutes.

So this may be a little bit misleading I suppose - it's meant as a wow factor and requires clarification :) I first preprocessed the data using SVD and took the first 100 components - this is fairly standard (typically people use PCA, but since sc data are sparse, PCA uses a lot more memory due to the centering step). I used the default perplexity 30, the runtime is quite dependent on this parameter. Also, here I'm still using FFTW - using numpy FFT (as currently implemented on master) would likely be a bit slower. I am planning to include the option of using FFTW if it's available. So the entire pipeline took longer to run, but the tSNE part did run in exactly 2 minutes. I did run it on 8 cores on an Intel Xeon, but these have lower clock speeds than typical i5 or i7 processors. But this is all a bit much to include at the top of the readme, but when I include proper benchmarks, I'll be sure to describe everything in detail.

I did not check how it compares to the C++ implementation, I have not had the time to run a proper set of benchmarks, but I will include them, in october (if not sooner) when my schedule clears up a bit. Likely they are comparable, since both use FFTW, maybe this one is a little bit faster because of I use NN descent and George uses annoy, which the benchmark you linked to says is slower.

dkobak commented 6 years ago

I think it's safe to say that using the weighted mean is probably the best of the three options.

I agree. Random initialization should not work very well, as you say. PCA either. It's good to know that optimizing tSNE loss does not change the positions too much, compared to the weighted mean initialization. I guess using tSNE loss can be seen as more principled though.

sometimes PCA is used to find a good initial embedding for the training set. This sometimes works, but for sc data specifically, I found it gives terrible results. It smears similar points across the space and tSNE has to group them back together.

This is actually not at all what I observe. I always recommend PCA for initializing t-SNE. I briefly looked at your code and it seems that when you use PCA, you just take the raw PCA projections as the initialization? The standard initialization is random Gaussian with a very small variance, and this is important to get t-SNE to converge nicely. So when I use PCA initialization, I take PCA projection and scale both x and y to have variance=0.01. I am wondering if this is maybe why you had "terrible" results.

(typically people use PCA, but since sc data are sparse, PCA uses a lot more memory due to the centering step)

Interesting remark! That's true. I usually use PCA after selecting ~1000 most variable genes, and I found that even for n=1.3mln dataset this works reasonably fast. But it's a good trick to use SVD without centering.

pavlin-policar commented 6 years ago

you just take the raw PCA projections as the initialization?

Yes, I believe I got this from scikit-learn. In that case, it's possible that they've got it wrong as well.

The standard initialization is random Gaussian with a very small variance, and this is important to get t-SNE to converge nicely.

Really? I was not aware of this at all. Why exactly is small variance so important?

So when I use PCA initialization, I take PCA projection and scale both x and y to have variance=0.01. I am wondering if this is maybe why you had "terrible" results.

I'll definitely have to try this. If small variance is really that important, this is very likely the reason for the bad results. Thanks for pointing this out!

I usually use PCA after selecting ~1000 most variable genes, and I found that even for n=1.3mln dataset this works reasonably fast.

Interesting, my approaches have been either straight up SVD, or some feature selection e.g. most variable genes - but I pick less of them e.g. 100-500 - and running tSNE on that directly. I'll have to try this method as well.

dkobak commented 6 years ago

Why exactly is small variance so important?

I am not entirely sure, and I did not experiment with it too much myself. My intuition is that if the "output" distances are large then the gradients will be small, and the optimization can get stuck.

But in any case, note that the standard initialization is Gaussian with std=0.0001, and if this value were irrelevant then people would just use std=1.

pavlin-policar commented 6 years ago

But in any case, note that the standard initialization is Gaussian with std=0.0001, and if this value were irrelevant then people would just use std=1.

Good point. I'll fix this ASAP. Thanks for pointing this out.

dkobak commented 6 years ago

I guess this can be closed now.

BTW, I am preparing a write-up on "How to use t-SNE effectively for scRNA-seq data". It will mainly focus on some useful tricks to preserve global geometry (I see in papers that everybody gets this very suboptimal), but I am also discussing things like projecting new cells on an exiting t-SNE atlas of a reference dataset (simple method - simply using knn), etc. I have all the code and figures almost ready, but still have to write the text. Should be done by the end of the month, I hope. If you like, I can send you the draft when it's ready; I'd very much appreciate your comments.

pavlin-policar commented 6 years ago

Sounds very interesting. I'd definitely be interested in giving it a read.

linqiaozhi commented 6 years ago

Great discussion here! Just one comment about the following:

(typically people use PCA, but since sc data are sparse, PCA uses a lot more memory due to the centering step)

Interesting remark! That's true. I usually use PCA after selecting ~1000 most variable genes, and I found that even for n=1.3mln dataset this works reasonably fast. But it's a good trick to use SVD without centering.

I presume you are using Lanczos methods for SVD of this large sparse matrix. You might consider using randomized PCA, instead (here is the standard review). There are advantages of randomized PCA over the Lanczos iteration methods, some of which are described here (let me know if the paywall gets in your way) in the context of our Matlab implementation.

I bring it up here though because you do not have to pre-center the data, but rather can keep it sparse throughout the algorithm. Randomized PCA is essentially just matrix multiplication, so you can just keep subtracting out the center after each matrix multiplication. You can see how we do this subtraction here.

The python package fbpca does this, so you might consider trying it out. Something like this is definitely possible with R, but I don't think the randomized SVD R package rsvd supports sparse matrices.

I mention this because SVD without first centering is not technically PCA (i.e. it's not the eigenvectors of the covariance matrix), but rather an (optimal) low rank approximation to the data. The resulting representations can be very different, but it's not clear to me if the distances between points would be very different (which is all t-SNE cares about). But using randomized methods you can get PCA without making the sparse matrix dense, so you might consider doing that to avoid any issues :) and not to mention that randomized methods are so fast, you don't even need to subselect your genes for computational reasons (although there might be other reasons to do that, of course).

pavlin-policar commented 6 years ago

I'm not familiar with the details of these methods, but Lanczos seems a lot like power iteration. I had heard of randomized methods for PCA, but never tried them. I guess I had always supposed that centring was a crucial step for PCA. I also just used the default implementations from scikit learn, I assume those are pretty well done. Thanks, I'll definitely take a look!

I used SVD because it was fast and didn't require me to form a dense matrix of my data. Since the data were also on the same scale i.e. in sc data the gene counts are comparable and meaningful to compare inside a single experiment, I assumed it wouldn't be too problematic. Of course, all of my assumptions could be wrong - I haven't been working with sc data very long and I am constantly being surprised. I'll definitely look at the randomized methods!

dkobak commented 6 years ago

Hey Pavlin. We have finally made our preprint available: https://www.biorxiv.org/content/early/2018/10/25/453449. I hope you don't mind that I thanked you in the acknowledgments for discussions! The format of the paper is a bit odd, due to the journal requirements where we decided to send it first. I'd obviously be very grateful for any comments.

pavlin-policar commented 6 years ago

Hi Dmitry, thanks!

I hope you don't mind that I thanked you in the acknowledgments for discussions!

Not at all.

I've given it a quick read and it's really interesting and well written.

I'd never heard of the multiscale approach you mention and I'll definitely give that paper a read. What a neat trick! You write that you found the global perplexity to be good around N/50 - why 50? I guess it's more of a guideline, but did you check what ranges are acceptable? Also, how sensitive is this to specific data sets?

In the section about adding new points to an embedding, you are using data from two different sources e.g. macosko and shekhar. I found that between sources, the batch effect can be quite large, and it's interesting that you get good results. Did you address batch effects in any way or do you think that maybe just using the nearest neighbors was good enough? My guess is that if the batch effect is large enough, then if we optimize the new points with the t-SNE criteron the points would move away from the reference embedding and the resulting embedding would be meaningless.

I am wondering about the data set alignment. Point initialiasations don't really restrict the point positions very much, so if you had a much larger data set, and "aligned" it to a much smaller initial set, my guess would be that, if left to optimize for a large number of iterations, the new embedding could look quite different from the original embedding. To me, it seems this would work if 1. you don't have too many new points (relative to the initial data) and 2. if the new data come from the same distribution as the old data i.e. you don't have any new cell types. Otherwise I don't think we can assume what the result will look like. This doesn't at all take away from the use case, which I think is very helpful and the example you showed is cool, but I the approach can sometimes fail. Any thoughts?

I'm also curious about the runtime of the 10x data set. I didn't think exaggeration had that much of an impact. I really like Figure 7, it's very pretty and really makes the impact of each step very apparent.

I'll give it a proper read ASAP, the multiscale approach seems particularly useful and I'll definitely to incorporate it into my library. Thanks for sharing, even at a glance, there's a lot of useful stuff here :)

dkobak commented 6 years ago

You write that you found the global perplexity to be good around N/50 - why 50? I guess it's more of a guideline, but did you check what ranges are acceptable? Also, how sensitive is this to specific data sets?

I guess I shouldn't have written N/50, it makes it sound too definitive. I might change it to N/100 -- N/10 in the revision. I don't remember seeing useful results with perplexity larger than N/10; I guess p_j|i get too close to the uniform distribution then (it gets exactly uniform at N-1). With perplexity less than N/100 one quite clearly is not capturing any global geometry anymore. So this gives some range.

Did you address batch effects in any way or do you think that maybe just using the nearest neighbors was good enough? My guess is that if the batch effect is large enough, then if we optimize the new points with the t-SNE criteron the points would move away from the reference embedding and the resulting embedding would be meaningless.

I did not do anything about the batch effect (apart from using correlation distance which I think should be a bit more robust than Euclidean distance; but I haven't experimentally compared them, to be honest). I think if the nearest neighbours are meaningful, then the points will not move away from the reference embedding even if the batch effect is enormous. A new cell is still attracted to its nearest neighbours, right?

Point initialiasations don't really restrict the point positions very much, so if you had a much larger data set, and "aligned" it to a much smaller initial set, my guess would be that, if left to optimize for a large number of iterations, the new embedding could look quite different from the original embedding. To me, it seems this would work if 1. you don't have too many new points (relative to the initial data) and 2. if the new data come from the same distribution as the old data i.e. you don't have any new cell types.

"Point initialiasations don't really restrict the point positions very much" -- I don't think that's true. For large datasets they restrict the final positions quite a lot, and one needs to take special care to prevent this from happening (e.g. early exaggeration etc.). So one can switch early exaggeration off, or make it small, etc. Of course if the new dataset is very different from the old, then the embedding can get very distorted, you are right. So this procedure assumes that the data distributions are similar. The number of points is less of an issue, I think.

I'm also curious about the runtime of the 10x data set. I didn't think exaggeration had that much of an impact.

Are you asking how long it took? I have some approximate benchmarks for 1mln points in the paper; for 1.3 mln points it's a bit slower than that but not too much. Regarding exaggeration -- yes, this seems to be really essential once one is beyond N=1e+6.

the multiscale approach seems particularly useful and I'll definitely to incorporate it into my library

To be completely honest, multiscale worked really well for the Tasic et al. 2017 dataset, but wasn't super useful for other datasets. One can argue that Tasic 2017 has the clearest hierarchical structure though. Still, it's quite easy to implement, and it's been described in the literature, so why not! If you implement it here, then I'd be happy to include a link to your library into the revision, because then your library would be capable of everything that we are doing in the paper. (Ah, wait, do you have functionality for using fixed sigma? I'd suggest to add that too, it's even simpler to do.)

By the way, by a remarkable coincidence, there was a related preprint published on the same day: https://www.biorxiv.org/content/early/2018/10/24/451690. I think you might find it interesting too.

pavlin-policar commented 6 years ago

I did not do anything about the batch effect

Actually I just remembered from some of my old experiments that dropout based feature selection actually does kind of, implicitly mitigate batch effects, not very well, but I guess it must have been good enough here.

I think if the nearest neighbours are meaningful, then the points will not move away from the reference embedding even if the batch effect is enormous. A new cell is still attracted to its nearest neighbours, right?

My reasoning against this is that if the batch effect is large, then the nearest neighbors will actually be very far away from the new point (probably less so for correlation distance, but it's pretty big for euclidean distance). This, in turn, means that the sigma for that point will have to be very large or equivalently, the Gaussian will have to be very flat to accommodate the nearest neighbors. This, in turn, means that the probabilities of each neighbor will be fairly low and the embedding will try to accommodate the low probability by positioning it farther away from the nearest neighbors. Have I made a mistake somewhere? Again, this only happens if we optimize, which you don't, and even still you might not see this because you kind of implicitly mitigate batch effects via feature selection.

For large datasets they restrict the final positions quite a lot, and one needs to take special care to prevent this from happening

Interesting, that's really very useful then, and this could explain the issue I was having with non-scaled PCA initialization. I suppose it's the scale of the embedding that restricts the points? On a large scale the repulsive forces are smaller, so the points don't move as much? I wonder what would happen if you took the initial embedding, scaled it down to have small variance, embedded the new points and, only then, ran the optimization.

Are you asking how long it took?

Yes, but I see I missed that part in the manuscript - sorry about that.

To be completely honest, multiscale worked really well for the Tasic et al. 2017 dataset, but wasn't super useful for other datasets.

But are the results ever worse than using the other method - perplexity annealing - which you use in almost all the other examples. The multiscale version is just a regular t-SNE run, we just consider more neighbors with different weights. Perplexity annealing requires two steps - if I got this correctly - first we run regular t-SNE (both early exaggeration and no exaggeration) with perplexity e.g. 500, then another run of t-SNE (again, both phases), but with a lower perplexity. In both cases we need to compute 1500 nearest neighbors for each point, but perplexity annealing requires twice as many iterations.

If you implement it here, then I'd be happy to include a link to your library into the revision

Sure thing! I'll add the missing pieces. You and George are doing fantasitc work on the C++ implementation, but distributing a C++ package for all platforms is a nightmare, hence this package.

Ah, wait, do you have functionality for using fixed sigma? I'd suggest to add that too, it's even simpler to do.

Do you mean that instead of adapting each \sigma_i for each point and finding it using perplexity and binary search, we just use a fixed, predefined value? If so, yes - that's super easy. Is this for the case you describe in the text where you have a large data set but small groups of cells ~50 cells for a particular type? How difficult is it to find a good sigma value? It seems like it might require a lot of trial and error, or is there a principled way to find a good setting?

By the way, by a remarkable coincidence, there was a related preprint published on the same day

I'll definitely take a look, thanks!

dkobak commented 6 years ago

This, in turn, means that the sigma for that point will have to be very large [...] This, in turn, means that the probabilities of each neighbor will be fairly low and the embedding will try to accommodate the low probability by positioning it farther away from the nearest neighbors. Have I made a mistake somewhere?

Yes, I think it's wrong. Sigma might be large, but all p_{j|i}'s are normalized to sum to 1 anyway. Think about what happens with strongly outlying points in t-SNE: they don't stay outside of the embedding but usually get attracted to the other points quite strongly and rather get attached to the next cluster.

I suppose it's the scale of the embedding that restricts the points? On a large scale the repulsive forces are smaller, so the points don't move as much? I wonder what would happen if you took the initial embedding, scaled it down to have small variance, embedded the new points and, only then, ran the optimization.

To be honest I'm not exactly sure about the mechanics of this, but on the small scale the points are more "transparent" and can move through each other easily. On large scale they don't seem to be able to move through distant points that easily anymore. I'd like to understand this better.

I did have to scale the initial embedding down in some cases :) I have this remark in the methods, see "Aligning Tasic et al. 2016 and 2017" paragraph: "Note that this initialisation was not scaled (as we do for PCA initialisation, see above) because otherwise it gets distorted during optimisation (however, when applying the alignment procedure to larger data sets, we did have to scale the initialisation as otherwise t-SNE did not converge well)". Also, when doing the downsampled-based initialisation for N=1.3 mln, I scaled down the initialisation to std=0.0001.

But are the results ever worse than using the other method - perplexity annealing - which you use in almost all the other examples. [...]

But I don't use annealing much. Let's see: for Tasic 2017 I use perplexity combination and it works very nicely. For Macosko I did also use perplexity combination. It worked kind of okay with perplexity 50 on its own, but I did like the result with 50+500 more. For Shekhar I used the annealing from fixed sigma, that was a bit of a special case. Tasic 2016 and Harris 2018 (with 1.5k and 3k cells) worked fine with perplexity 50 and PCA initialization (possibly even with random init.). For the N=25000 subsample of the 10x dataset I used perplexity=500. If I remember correctly, 50+500 did not look better (but not worse either).

I think one would need to explore more datasets with N in the 10000--100000 range to say what works more often. In this manuscript I rather tried to show a variety of different approaches that can be used. I think perplexity combination is a nice tool to have.

You and George are doing fantasitc work on the C++ implementation, but distributing a C++ package for all platforms is a nightmare, hence this package.

I am using Python so I might very well switch to your implementation myself :) I have some ideas I want to experiment with and it's easier to experiment with the full Python code.

Do you mean that instead of adapting each \sigma_i for each point and finding it using perplexity and binary search, we just use a fixed, predefined value? If so, yes - that's super easy. Is this for the case you describe in the text where you have a large data set but small groups of cells ~50 cells for a particular type? How difficult is it to find a good sigma value? It seems like it might require a lot of trial and error, or is there a principled way to find a good setting?

Yes, that's what I mean. You need two parameters: the value of sigma and the K nearest neighbours that this sigma is applied to (for perplexity, K is chosen as 3*perplexity, but for fixed sigma one needs to supply K). Yes, I did use for Shekhar data, but I find it's a nice feature to have to play around. Re trial and error -- yes, it does require quite some. See also "Step 4" here https://github.com/berenslab/rna-seq-tsne/blob/master/demo.ipynb and my comments there.

pavlin-policar commented 6 years ago

Think about what happens with strongly outlying points in t-SNE: they don't stay outside of the embedding but usually get attracted to the other points quite strongly and rather get attached to the next cluster.

Yes, but I don't understand why this happens... Empirically, the outliers get pulled together, but why?

Sigma might be large, but all p_{j|i}'s are normalized to sum to 1 anyway.

True, but P is normalized as a whole, not row-wise, so all the probabilities will still be scaled the same way. I mean clearly I'm missing something, somewhere, I'll figure it out :)

In this manuscript I rather tried to show a variety of different approaches that can be used.

Like the title says, t-SNE is an art! I'm just thinking what might be safe defaults. Both the extensions - multiscale and annealing - probably won't hurt the visualization quality, just the runtime. The trend from these 6 data sets seems to be that the more points we have, the more we need to do to introduce global structure. But 6 is very few data sets to draw definitive consclusions off.

I am using Python so I might very well switch to your implementation myself :) I have some ideas I want to experiment with and it's easier to experiment with the full Python code.

Glad to hear it! I completely agree, trying things out in Python is much simpler than C++.

dkobak commented 6 years ago

True, but P is normalized as a whole, not row-wise, so all the probabilities will still be scaled the same way.

But it is normalized row-wise! For each $i$, all $p_{j|i}$ are normalized to sum to 1. After that P is symmetrized and multiplied by 1/N to normalize "as a whole", but the important step is the row-wise normalisation in the beginning!

pavlin-policar commented 6 years ago

Wow, you're absolutely right. How on earth did I manage to forget that - I wrote it in the code after all! It's been a while I suppose :) Yes. Then this makes perfect sense. Thanks for clearing this up.

dkobak commented 6 years ago

One thing about multiscale similarities that I want to mention: the approach of Lee et al. (and the one that I implemented in FIt-SNE) is to compute p{j|i} with perplexity 50 and 500 and then average; i.e. each set of values is row-normalized, then averaged. This is sort of like using a kernel that is an average of two Gaussians, a narrow and a wide one (see page 16 of our preprint). However, I think it's NOT exactly the same; using such a kernel would mean computing p{j|i} via this kernel and then row-normalizing (only once).

Somehow using such a "multi-scale" kernel makes more sense to me theoretically, but I only realized that there is a difference a couple of days ago, and did not have a chance to try it out yet.

pavlin-policar commented 6 years ago

I wrote a short script to check this (hopefully I got it right)

normalize = lambda x: x / np.sum(x)
gauss = lambda x, std: np.exp(-x ** 2) / (2 * std ** 2) / std
x = np.array([.5, 1, 2, 4, 6])

# Option 1: compute GMM probabilities -> average -> normalize
result1 = normalize((gauss(x, 1) + gauss(x, 5)) / 2)
# [0.67 0.32 0.02 0.   0.  ]

# Option 2: compute probabilities -> normalize -> average -> normalize
result2 = normalize(normalize(gauss(x, 1)) + normalize(gauss(x, 5)))
# [0.67 0.32 0.02 0.   0.  ]

So it seems they actually are equivalent.

Also, I've just noticed you forgot minus signs when defining the Gaussian kernel on page 16 i.e. P ~ exp( - mu² / 2sigma²)

dkobak commented 6 years ago

It should be gauss = lambda x, std: np.exp((-x ** 2) / (2 * std ** 2)) / std and then the results are not the same anymore...

Thanks for noticing the typo!

pavlin-policar commented 6 years ago

Oh, you're right. Well there we go, they aren't equivalent. In that case, I agree, the kernel seems more principled.

Shape-wise they are pretty similar and it seems that differences are fairly small. I wonder whether the difference would be noticeable on the embedding.

plt.subplot(211)
plt.plot(x, normalize((gauss(x, 50) + gauss(x, 500)) / 2), label='Option 1')
plt.plot(x, normalize(normalize(gauss(x, 50)) + normalize(gauss(x, 500))), label='Option 2')
plt.legend()
plt.subplot(212)
plt.plot(x, normalize((gauss(x, 50) + gauss(x, 1000)) / 2), label='Option 1')
plt.plot(x, normalize(normalize(gauss(x, 50)) + normalize(gauss(x, 1000))), label='Option 2')
plt.legend()

dkobak commented 6 years ago

I wonder whether the difference would be noticeable on the embedding.

Exactly -- that's what I am wondering as well. Might try it out during next week.

pavlin-policar commented 6 years ago

@dkobak I suppose another difference between the two methods is that with the multiscale Gaussian mixture, we're evaluating the kernel on the same number of neighbors for all the scales. The way I've implemented it now is to find the largest perplexity, compute nearest neighbors then evaluate a multiscale kernel (like the one on p16 in your paper) on all the neighbors. Please, correct me if this is wrong. I believe this is what is described in the paper you cited.

Conversely, using the other method (so Option 2) computes the probabilities on a different number of neighors each time. So for perplexity 50, we compute probabilities for 150 neighbors then normalize, and for 500 neighbors we compute probabilities for 1500 neighbors then normalize. Then average the two, then normalize again one final time.

If what I've written above is correct, then the differences will be even larger than the ones I plotted.

dkobak commented 6 years ago

To be honest, I don't think this matters.

The way I've implemented it now is to find the largest perplexity, compute nearest neighbors then evaluate a multiscale kernel (like the one on p16 in your paper) on all the neighbors. Please, correct me if this is wrong. I believe this is what is described in the paper you cited.

That's OK but are you sure this is described in Lee et al? Actually I was under impression that they are describing Option 2!

Conversely, using the other method (so Option 2) computes the probabilities on a different number of neighors each time. So for perplexity 50, we compute probabilities for 150 neighbors then normalize, and for 500 neighbors we compute probabilities for 1500 neighbors then normalize.

Yes you can compute it like that, but you can as well compute probabilities for all 1500 neighbors with perplexity 50. The probabilities for the 1500-150=1350 non-nearest neighbors will be around 0 anyway. Usually with perplexity 50 we only use 150 neighbors to avoid computing near-zero numbers and save time. But here we have to deal with 1500 neighbors anyway. So we can as well use all 1500. That's how I implemented it in C++.

Then average the two, then normalize again one final time.

Note that you don't need to normalize again. Both sets sum to 1, so the average will sum to 1.

pavlin-policar commented 6 years ago

Actually I was under impression that they are describing Option 2!

Oops, you're right, they describe Option 2. I was only looking at what neighbors they consider, and it seems they consider the same number of neighbors for each perplexity value.

You're right, of course. Thanks for clearing that up!

dkobak commented 6 years ago

If I remember correctly, they are not very concerned with the implementation there. The largest dataset that they analyze is what, N=3000? They probably use exact algorithm and compute full NxN similarity matrix anyway.

pavlin-policar commented 6 years ago

TBH I haven't gone through the entire paper yet. I skimmed through the first couple of section to get to the multiscale formulation to see what it's about. It seemed simple enough, so I didn't dig deeper yet. The backlog of things I need to read up on is quickly building up :)

dkobak commented 6 years ago

One thing I noticed now: when you change from early exaggeration to no exaggeration (etc.), you run optimize() several times. It seems that every time you re-start the adaptive gradient descent from scratch, is that right? The canonical implementation starts it only once and changes the optimization parameters (momentum and exaggeration) on the fly, without re-starting the adaptive adjustments of the learning rate ("gains").

pavlin-policar commented 6 years ago

Yes, that's right. I saw that scikit-learn does this and it made the code a little bit simpler, so I ran with it.

Resetting the gains every time means the convergence might be a bit slower, but it's not clear to me by how much, and whether or not it's noticeable at all. I am able to produce good embeddings, so this hasn't worried me too much but it has been in the back of my mind. I suppose it may be more noticeable with larger data sets where the number of iterations needed is already quite high, so this may exacerbate it further.

Changing this so the parameters are kept between optimize calls is trivial, so if you think this is a problem, I'll change it.

dkobak commented 6 years ago

I don't know how much it matters in practice. I suppose it might matter for huge datasets when learning rate is too small - then 100s of iterations can be needed to increase the gains enough for the optimisation to start working, see Belkin at al. Preprint I linked above.

I would change it in any case for consistency with the reference implementation. It should be easy to store the gains array outside of the optimize function. You could have an optional resetGains=False parameter if you want. The scikit implementation has been notoriously problematic btw so I wouldn't take it as a reference.

pavlin-policar commented 6 years ago

The scikit implementation has been notoriously problematic btw so I wouldn't take it as a reference.

I know their BH implementation is horribly slow, but I am not aware of any other problems.

pavlin-policar commented 6 years ago

@dkobak I've been playing around with this a little bit, specifically the Macosko data set.

Firstly, the procedure described in the paper isn't entirely right. This is a UMI data set and when doing feature selection, the threshold is 0 and not 32 (as described in the paper). You can find this in your umi-datasets notebook. If we use the 32 threshold, almost no genes are selected. Your code is correct, but the description in the paper is wrong.

Also, looking at the code, this line is confusing to me

U[:, np.sum(V,axis=1)<0] *= -1

This is done when doing PCA on the 3000 selected genes. What exactly is the point of this?

Secondly, I've been playing around with the multiscale approach we discussed a while ago in this thread. There are noticeable differences between the two methods. Here, I implemented the kernel method, which we agreed seemed more principled.

affinities = Multiscale(x_reduced, perplexities=[50, 500], method='approx', n_jobs=8)
init = initialization.pca(x_reduced)
embedding = TSNEEmbedding(
    init, affinities, negative_gradient_method='fft',
    learning_rate=1000, n_jobs=8, callbacks=ErrorLogger(),
)
embedding.optimize(n_iter=250, exaggeration=12, momentum=0.5, inplace=True)
embedding.optimize(n_iter=750, exaggeration=1, momentum=0.8, inplace=True)

Notice that the dark blue bipolar cells don't form the nice star shape in your paper. It took a bit of tinkering to get that shape.

affinities = Multiscale(x_reduced, perplexities=[50, 500], method='approx', n_jobs=8)
init = initialization.pca(x_reduced)
embedding = TSNEEmbedding(
    init, affinities, negative_gradient_method='fft',
    learning_rate=1000, n_jobs=8, callbacks=ErrorLogger(),
)
embedding.optimize(n_iter=250, exaggeration=12, momentum=0.5, inplace=True)
embedding.optimize(n_iter=750, exaggeration=2, momentum=0.8, inplace=True)
embedding.optimize(n_iter=500, exaggeration=1, momentum=0.8, inplace=True)

I thought this may be of interest to you.

dkobak commented 6 years ago

Hey, thanks!

Of course the threshold should be 0, this is a mistake in the preprint. Thanks for noticing, I put it on my list of things I need to fix when I do a revision.

Regarding the sign of the components, this is briefly mentioned in the preprint as follows:

The sign of the principal components is arbitrary. To increase reproducibility of the figures, we always fix the sign of the first two PCs such that for each PCA eigenvector the sum of its values was positive (i.e. if it is negative, we flip the sign).

Does it make sense?

Regarding the kernels -- I am glad that the "Gaussian mixture kernel" works fine here. The bipolar star (butterfly?) shape is nice but isn't really important -- I think it's pretty coincidental. I don't think it reflects any meaningful bipolar geometry to be honest. Just looks pretty :) especially together with the rod heart :)

Do you actually have both Multiscale options implemented in here now?

pavlin-policar commented 6 years ago

Does it make sense?

Yes, that for clearing that up.

The star shape is nice, but more than that, it probably captures more global structure. Running standard t-SNE with 500 perplexity also exhibits this shape

so in addition to looking pretty, it might actually hint indicate that all the cell lineages have the same origin and the further we move from the center, the more differentiated the cells have become. I may be reading way too much into this little plot, but that seems like a neat interpretation :) Who knew mouse retina cells could be so romantic :)

Do you actually have both Multiscale options implemented in here now?

I've just implemented it now in #34. I haven't tested it much yet.