Reproducing clustering results for the 1.3 mln dataset

dkobak commented 5 years ago

Are Louvain clustering results supposed to be reproducible? I ran the clustering on the 1.3 mln dataset, exactly following the code provided here https://github.com/theislab/scanpy_usage/blob/master/170522_visualizing_one_million_cells/cluster.py as follows:

import scanpy.api as sc
sc.settings.verbosity = 2
adata = sc.read_10x_h5('1M_neurons_filtered_gene_bc_matrices_h5.h5') 
sc.pp.recipe_zheng17(adata) 
sc.pp.neighbors(adata) 
sc.tl.louvain(adata)        
adata.obs['louvain'].to_csv('clustering-scanpy.csv')

This works fine and I get the clustering result that makes sense when visualized, however it differs quite strongly from the clustering result that Alex sent me some time ago (as I understood him, that were exactly the clustering results used for the visualisation in the Scanpy paper).

My questions:

Is re-running the above snippet supposed to give me the exact same results every time, or are there some random seeds used along the way?
If yes, how can one modify the code to ensure reproducibility?
Is there any way to modify the snippet to get the exact same results as shown here https://github.com/theislab/scanpy_usage/tree/master/170522_visualizing_one_million_cells and in the published Scanpy paper?
If the answer to the previous question is no, could you make those results publicly available somewhere?

I noticed the following warning from sc.pp.neighbors that might be relevant:

WARNING: You're trying to run this on 1000 dimensions of .X, if you really want this, set use_rep='X'. Falling back to preprocessing with sc.pp.pca and default params. Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose svd_solver='arpack'. This will likely become the Scanpy default in the future.

falexwolf commented 5 years ago

It gives exactly the same results when rerunning on the same platform, but scikit-learn's randomized PCA sometimes fails to reproduce exactly the same results when run on different platforms. That's why since some time, the warning is output.
Setting svd_solver='arpack' resolves that problem.
That's probably hard, as at the time, the results were produced using the randomized version expecting that setting the same seed on a different would reproduce this also elsewhere.
What I uploaded for you at the time were the clustering results shown in the PAGA paper and the Scanpy paper, they are still linked in the issue in the PAGA repo: https://github.com/theislab/paga/issues/1#issuecomment-404263982; I can easily upload these clustering results also to https://github.com/theislab/scanpy_usage/tree/master/170522_visualizing_one_million_cells, that's no problem at all. There was one issue with right? The vector is shifted by one cell or something? Can I change something so that the problem you had at the time doesn't come up again?

Sorry about the whole issue!

dkobak commented 5 years ago

Starting from the end: I think if you could upload the clustering results from the Scanpy paper / PAGA preprint to the scanpy github repo, it would be great. I still have the dropbox link of course, but I guess in the long run it's better if that file was located here and linked from the https://github.com/theislab/scanpy_usage/tree/master/170522_visualizing_one_million_cells page. The issue with 1 cell missing was because I did not specify header=None when loading it with Pandas :) So my error, not yours. The file is correct as is.

That said, I am worried about the influence the random seed in randomized PCA seems to give in this case. Let me show you how it looks:

mln-tsne-clustering-comparison

I would be fine with some cells getting into other clusters depending on the random seed, and it would even be okay if small clusters changed their identities, but what we see here is a very drastic change of the cluster structure. Are you sure that the only difference is the randomized PCA outcome? Can it be that some of the default parameters in sc.pp.recipe_zheng17, sc.pp.neighbors, or sc.tl.louvain changed since when you ran the clustering? The scanpy code I posted above is the full code I used, and I ran it yesterday after updating scanpy via pip.

BTW, the visualization above is taken from https://www.biorxiv.org/content/early/2018/10/25/453449 which we posted yesterday. Any comments very welcome! I hope you don't mind being thanked in the acknowledgements!

falexwolf commented 5 years ago

Sorry for the slow response! Congrats on the very nice preprint and thanks for thanking me! :wink:

I'll upload the file from dropbox to scanpy_usage!

Regarding the result: the high PCs can change drastically depending on the platform and the random seed. I've seen clustering results changing completely after I became aware of it. I'm very sure that non of the preprocessing and other code changed at any point.

What do you think?

dkobak commented 5 years ago

Great. Please ping me here when you upload the file to scanpy_usage and feel free to close the issue then. I'll update my script to link directly to scanpy_usage.

Regarding the result: the high PCs can change drastically depending on the platform and the random seed. I've seen clustering results changing completely after I became aware of it.

I don't have much experience with randomized PCA, but this is very disturbing, no? Was your feeling that the PCs themselves changed strongly (as measured, I don't know, by the %% of total captured variance, or maybe angle between subspaces, etc.), or is it rather that clustering outcome is dangerously sensitive to small changes in the data? I think this is something worth investigating.

LuckyMD commented 5 years ago

I have observed the latter. Converting between float64 and float32 changed my clustering quite a bit where cluster boundaries fell into more densely sampled regions of transcriptome space. PCs were similar to within a couple of decimal points, but the KNN graph connectivities changed more drastically.

dkobak commented 5 years ago

@LuckyMD Thanks. This agrees with what I suspected: that randomized PCA itself should be pretty stable, but downstream clustering procedures can be very unstable. I have little experience with clustering so I cannot really comment further, but this certainly should be a big red flag for taking a clustering result seriously. It's especially impressive that float32 vs float64 can cause such a difference.

Did you observe this influencing t-SNE/UMAP/etc equally drastically, or did it only affect clustering so strongly?

LuckyMD commented 5 years ago

The reason I actually noticed it in the first place was a difference in t-SNE and UMAP... the difference wasn't huge, but noticeable. Here's the issue where you can see the UMAP plots (#324). The clustering differences were actually rather small... the more worrying thing was the UMAP at the time as it appeared to show a bridge between two clusters that I didn't really expect to be there and didn't see in the float64 data.

falexwolf commented 5 years ago

Great. Please ping me here when you upload the file to...

I uploaded the file.

falexwolf commented 5 years ago

This agrees with what I suspected: that randomized PCA itself should be pretty stable,

No, if you look into how higher PCs vary, you see that they vary drastically depending on the seed or computational platform. That also makes sense, it's a power-method that does the computation, that runs into some unstable stuff.

PCs were similar to within a couple of decimal points,

I'm very sure that you only observed this for the first couple of PCs, which you robustly estimate. Going higher, you end up in some local minima for a subsapce; I believe that it doesn't mean it doesn't capture important variation; it just means that it's a local minimum that the algorithm converges into... something we observe all the time when training models... in the context of Lanzcos and other algorithms powering SVD, PCA etc., it's usually a nuisance that you have that instability when you go higher in iterations; but also there, people just use the results even if they know they don't get the exact 50th eigen vector...

LuckyMD commented 5 years ago

True, I only checked the first couple of PCs. I mainly noticed that while the top PCs varied in the 3rd or 4th decimal place, the connectivities varies within the first decimal place or even more. I can't make any statement about the later PCs.

scverse / scanpy

Reproducing clustering results for the 1.3 mln dataset #325