scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.9k stars 599 forks source link

Inconsistent results from arpack pca implementation using VMs with different numbers of cpus #1187

Open dylkot opened 4 years ago

dylkot commented 4 years ago

I am finding that my analysis is not perfectly reproducible across different computational platforms. I thought I was going crazy but I have since reproduced this finding using the minimal 3000 PBMC dataset clustering example. Essentially I run the same code either on a virtual machine with 8 CPUs or one with 16 CPUs and I get non-identical PCA results. It doesn't seem to matter if I use the arpack or the randomized solver even though using the randomized solver gives the warning:

Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choosesvd_solver='arpack'.This will likely become the Scanpy default in the future.

I'd like to just attach the jupyter notebook but it won't seem to let me do that so I'm copying the code below. ...

# First run on a machine with 8 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8_randomized.h5ad', adata)

# Then run on a machine with 16 CPUs
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16_randomized.h5ad', adata)

# Running on a machine with 16 CPUs, evaluate the differences between the results first from the arpack solver
adata8 = sc.read('test8.h5ad')
adata16 = sc.read('test16.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

# Running on a machine with 16 CPUs, evaluate the differences between the results from the randomized solver
adata8 = sc.read('test8_randomized.h5ad')
adata16 = sc.read('test16_randomized.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

This outputs the following

0
134513
37696
0    659
1    605
2    398
3    352
4    342
5    174
6    118
7     41
8     11
Name: leiden, dtype: int64
0    527
1    484
2    398
3    324
4    320
5    301
6    174
7    109
8     52
9     11
Name: leiden, dtype: int64

0
134127
37278
0    646
1    617
2    382
3    362
4    334
5    173
6    129
7     46
8     11
Name: leiden, dtype: int64
0    646
1    631
2    408
3    349
4    334
5    170
6    106
7     45
8     11
Name: leiden, dtype: int64
...

Versions:

scanpy==1.4.4.post1 anndata==0.7.1 umap==0.3.10 numpy==1.18.1 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.22.2.post1 statsmodels==0.11.1 python-igraph==0.8.0 louvain==0.6.1

LuckyMD commented 4 years ago

Please check out this issue report here on the same topic: #313

dylkot commented 4 years ago

Yes, I noticed #313 but missed #1009. #313 seems to be related to the score_genes_cell_cycle function and seems to be fixed by setting the random seed ahead of time (at least for some people). #1009 also doesn't seem to be related to the random seed but it seems that they ruled out PCA as the source of their discrepancy whereas mine seems to stem from the discrepant PCA. Should I merge this issue with one of those?

LuckyMD commented 4 years ago

That issue report mentions setting the PYTHONHASHSEED environment variable to 0 (next to all the seed setting) worked to create a fully reproducible workflow. If that doesn't work for you, it might be good to continue the discussion there.

dylkot commented 4 years ago

I confirmed that setting the PYTHONHASHSEED environmental variable to 0 did not change the results. The code run below (in jupyter notebook) gave the same results as before while confirming that the PYTHONHASHSEED variable was set to 0 before running the pipeline.

# First run on a machine on with 8 CPUs
%env PYTHONHASHSEED=0
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8_randomized.h5ad', adata)
! echo $PYTHONHASHSEED

# Then run on a machine on with 16 CPUs
%env PYTHONHASHSEED=0
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
    './data/filtered_gene_bc_matrices/hg19/', 
    var_names='gene_symbols',
    cache=True) 

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16_randomized.h5ad', adata)
! echo $PYTHONHASHSEED

# Running on a machine with 16 CPUs, evaluate the differences between the results first from the arpack solver
adata8 = sc.read('test8.h5ad')
adata16 = sc.read('test16.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())

# Running on a machine with 16 CPUs, evaluate the differences between the results first from the randomized solver
adata8 = sc.read('test8_randomized.h5ad')
adata16 = sc.read('test16_randomized.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())
LuckyMD commented 4 years ago

That is interesting... do you know where the randomness is coming in? I think sc.pp.highly_variable_genes() can have some variability. These two VMs have the same operating system and hardware otherwise, right? I've had reproducibility issues moving between Fedora 25 and 28. In the end the libraries we use rely on underlying kernel numerics. There's a limit to how reproducible one can be. This only really becomes an issue if the biological interpretation is no longer consistent. Of course we'd like to be reproducible before then as well.

dylkot commented 4 years ago

I am getting the same highly variable genes between the two runs. The discrepancy is introduced at the PCA step which generates slightly different results between the two runs.

The biological interpretation ends up essentially the same in my case but the clusterings are subtly different, making it hard to automate my annotation. I would like the overall pipeline to be reproducible across platforms if possible. I can dig a bit into the PCA code... it seems like this might be an issue on the scikit-learn end.

LuckyMD commented 4 years ago

I recall that this issue occurred even in a docker container, so I'm not sure we can perfectly automate this. I'd love to know if you find out how this could be addressed.

dylkot commented 4 years ago

Right, I think it makes sense that this would happen in a docker container based on what I'm seeing. I'll let you know if I find a solution!

ivirshup commented 4 years ago

@dylkot, a few questions:

dylkot commented 4 years ago

I am actually booting using the exact same disks so identical OS (Ubuntu 16.04) and BLAS libraries. I am just loading them up with different virtual machines with different numbers of CPUs. In both cases the CPUs are Intel Xeon E5 v3 (Haswell).

Have not tried limiting the number of CPUs used by arpack. I didn't know that was something I could do! Do you have a tip on how to do so? I'll look this up and give it a shot.

ivirshup commented 4 years ago

IIRC, you can limit the number of CPUs used through blas. This works on my machine:

export OMP_NUM_THREADS=1

Different blas libraries use different environment variables for this, so I'd check to make sure it's actually restricting the number of threads used.