scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.92k stars 600 forks source link

Problem at reproducibility of UMAP / leiden #1009

Open alexmascension opened 4 years ago

alexmascension commented 4 years ago

Hi,

I am working on a project with a labmate and we are using the same dataset. We have found that, when running the same pipeline on the same adata the neighbors / bbknn + UMAP + leiden results, even with the same seed, the clustering solution and UMAP are considerably different. This renders the analysis unreproducible and makes the downstream analysis far more difficult to do, since I have to map my clustering solutions and UMAP plots with hers using markers, and it is quite impractical.

We have the same versions of scanpy, leiden, umap, and bbknn on the two computers:

To try to reproduce the issue, we have created a random matrix with the same seed (10), and create one annData with sc.pp.neighbours, and another one with bbknn. We have made the adatas to have two batches, so that we can use bbknn.

seed = 10
np.random.seed(seed)
a = np.random.rand(100, 100)
b = np.random.rand(100, 100)
print(np.sum(a), np.sum(b))

adata = sc.AnnData.concatenate(sc.AnnData(X=a), sc.AnnData(X=b), batch_categories=['a', 'b'])
sc.tl.pca(adata)
sce.pp.bbknn(adata, metric='angular')
sc.tl.umap(adata, random_state=seed)
sc.tl.leiden(adata, resolution=0.5, random_state=seed)
sc.pl.umap(adata, color=['batch', 'leiden'], alpha=0.3)
print(adata.uns['neighbors']['connectivities'].sum())

adata_neigh = adata.copy()
sc.pp.neighbors(adata_neigh, metric='cosine', random_state=seed)
sc.tl.umap(adata_neigh, random_state=seed)
sc.tl.leiden(adata_neigh, resolution=0.6, random_state=seed)
sc.pl.umap(adata_neigh, color=['batch', 'leiden'], alpha=0.3)
print(adata_neigh.uns['neighbors']['connectivities'].sum())

Our matrices are the same (the sums are 4918.370372081173 and 5005.088472351332), so the random generation works, but then the UMAPs and clustering solutions are different.

For the adata run with sc.pp.neighbors (left are batches and right are leiden cluster labels): Mine image

Hers image

For the adata run with sce.pp.bbknn: Mine image

Hers image

Our PCA decomposition has the same coordinates, so we discard the PCA as the source of variability. Also, since both UMAP and leiden look different, we think the source might come from the neighbor calculation.

When running adata.uns['neighbors']['connectivities'].sum() I get 801.5580058219996 and 1204.5274490986717 for adata and adata_neigh. I don't have her values now, but the values using the real dataset were in the order of 5000, and they were of by less than 0.001; so we are confused that with such a small difference on the sum, the results can be so different.

I attach the adatas for you to inspect them if you need more info.

adatas.zip

Thanks for the help!

atarashansky commented 4 years ago

I ran your code snippet multiple times on my end and I got the same results each time. Is this true for you as well? If both you and your partner can generate the same results consistently each time, then it is strange that your results disagree with each other...

Some followup questions I have: 1) Are ALL packages the same version? (Packages like Numba, scipy, sklearn, etc. should also be the same version to remove that as a potential source of variability) 2) Are you guys using the same operating system? 3) Can you run UMAP directly on the randomly generated matrix to see if your embeddings are the same? If they are, UMAP is likely not at fault. 4) If you perturb your nearest neighbor matrix by adding noise to the edges such that the total edge weight differs by ~0.001 between perturbations, can you recreate the big differences in the UMAP projection? Small differences in the edge weights of the nearest neighbor graph CAN lead to huge differences in the UMAP projection if the graph has no inherent structure (which should be the case for randomly generated data).

LuckyMD commented 4 years ago
4\. Small differences in the edge weights of the nearest neighbor graph CAN lead to huge differences in the UMAP projection if the graph has no inherent structure

Exactly this! We have come across the difficulty of exactly reproducing the umap and clustering results in our single-cell tutorial/best practices, however it was always due to the difficulty of defining boundaries in a continuous phenotype. Essentially, that means that the biological interpretations should not rely on this moving boundary anyway.

On another note, you may have more luck with reproducibility by setting PYTHONHASHSEED as well. Check out the discussion here: #313.

alexmascension commented 4 years ago

Hi! I've tried first the 4th point from @atarashansky and yeah, multiplying the distance, connectivities, or both with a normal distribution with std 0.0001 (which makes roughly a 0.008% of absolute difference in the matrix) and the results are different.

Python code:

c0 = adata.uns['neighbors']['connectivities']
connect = adata.uns['neighbors']['connectivities'].todense()
connect = np.multiply(connect, np.random.normal(1, 0.0001, connect.shape))
adata.uns['neighbors']['connectivities'] = spr.csr_matrix(connect)
print(100*np.sum(np.abs(c0 - connect))/min(np.sum(c0), np.sum(connect)))
c0 = adata.uns['neighbors']['distances']
csum = adata.uns['neighbors']['distances'].todense().sum()
connect = adata.uns['neighbors']['distances'].todense()
connect = np.multiply(connect, np.random.normal(1, 0.0001, connect.shape))
adata.uns['neighbors']['distances'] = spr.csr_matrix(connect)
print(100*np.sum(np.abs(c0 - connect))/min(np.sum(c0), np.sum(connect)))

Before image

After image

I mean, the results (hopefully) are not drastically different, but there are minor rearrangements and clustering assignments that might make the analysis be rearranged depending on the computer.

I've realized that, for some reason, the differences between points are not associated to different random seeds, that is, after setting the seed, the PCA look equal but are different in less than 0.001 when doing a pairwise comparison. This happens as well to the neighbors graph, so I believe that randomness is not really an issue. However I am puzzled as to why the differences in PCA / neighbour graphs is so small. Shouldn't two different computers do calculations equally at least at 3 or 4 decimals?