Open dylkot opened 4 years ago
Please check out this issue report here on the same topic: #313
Yes, I noticed #313 but missed #1009. #313 seems to be related to the score_genes_cell_cycle function and seems to be fixed by setting the random seed ahead of time (at least for some people). #1009 also doesn't seem to be related to the random seed but it seems that they ruled out PCA as the source of their discrepancy whereas mine seems to stem from the discrepant PCA. Should I merge this issue with one of those?
That issue report mentions setting the PYTHONHASHSEED
environment variable to 0
(next to all the seed setting) worked to create a fully reproducible workflow. If that doesn't work for you, it might be good to continue the discussion there.
I confirmed that setting the PYTHONHASHSEED environmental variable to 0 did not change the results. The code run below (in jupyter notebook) gave the same results as before while confirming that the PYTHONHASHSEED variable was set to 0 before running the pipeline.
# First run on a machine on with 8 CPUs
%env PYTHONHASHSEED=0
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
'./data/filtered_gene_bc_matrices/hg19/',
var_names='gene_symbols',
cache=True)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test8_randomized.h5ad', adata)
! echo $PYTHONHASHSEED
# Then run on a machine on with 16 CPUs
%env PYTHONHASHSEED=0
import numpy as np
import pandas as pd
import scanpy as sc
adata = sc.read_10x_mtx(
'./data/filtered_gene_bc_matrices/hg19/',
var_names='gene_symbols',
cache=True)
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata = adata.copy()
sc.pp.scale(adata, max_value=10)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.tl.pca(adata, svd_solver='arpack', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16.h5ad', adata)
sc.tl.pca(adata, svd_solver='randomized', random_state=14)
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40, random_state=14)
sc.write('test16_randomized.h5ad', adata)
! echo $PYTHONHASHSEED
# Running on a machine with 16 CPUs, evaluate the differences between the results first from the arpack solver
adata8 = sc.read('test8.h5ad')
adata16 = sc.read('test16.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())
# Running on a machine with 16 CPUs, evaluate the differences between the results first from the randomized solver
adata8 = sc.read('test8_randomized.h5ad')
adata16 = sc.read('test16_randomized.h5ad')
print((adata8.X != adata16.X).sum())
print((adata8.obsm['X_pca'] != adata16.obsm['X_pca']).sum())
print((adata8.uns['neighbors']['connectivities'] != adata16.uns['neighbors']['connectivities']).sum())
sc.tl.leiden(adata8, random_state=14)
sc.tl.leiden(adata16, random_state=14)
display(adata8.obs['leiden'].value_counts())
display(adata16.obs['leiden'].value_counts())
That is interesting... do you know where the randomness is coming in? I think sc.pp.highly_variable_genes()
can have some variability. These two VMs have the same operating system and hardware otherwise, right? I've had reproducibility issues moving between Fedora 25 and 28. In the end the libraries we use rely on underlying kernel numerics. There's a limit to how reproducible one can be. This only really becomes an issue if the biological interpretation is no longer consistent. Of course we'd like to be reproducible before then as well.
I am getting the same highly variable genes between the two runs. The discrepancy is introduced at the PCA step which generates slightly different results between the two runs.
The biological interpretation ends up essentially the same in my case but the clusterings are subtly different, making it hard to automate my annotation. I would like the overall pipeline to be reproducible across platforms if possible. I can dig a bit into the PCA code... it seems like this might be an issue on the scikit-learn end.
I recall that this issue occurred even in a docker container, so I'm not sure we can perfectly automate this. I'd love to know if you find out how this could be addressed.
Right, I think it makes sense that this would happen in a docker container based on what I'm seeing. I'll let you know if I find a solution!
@dylkot, a few questions:
I am actually booting using the exact same disks so identical OS (Ubuntu 16.04) and BLAS libraries. I am just loading them up with different virtual machines with different numbers of CPUs. In both cases the CPUs are Intel Xeon E5 v3 (Haswell).
Have not tried limiting the number of CPUs used by arpack. I didn't know that was something I could do! Do you have a tip on how to do so? I'll look this up and give it a shot.
IIRC, you can limit the number of CPUs used through blas. This works on my machine:
export OMP_NUM_THREADS=1
Different blas libraries use different environment variables for this, so I'd check to make sure it's actually restricting the number of threads used.
I am finding that my analysis is not perfectly reproducible across different computational platforms. I thought I was going crazy but I have since reproduced this finding using the minimal 3000 PBMC dataset clustering example. Essentially I run the same code either on a virtual machine with 8 CPUs or one with 16 CPUs and I get non-identical PCA results. It doesn't seem to matter if I use the arpack or the randomized solver even though using the randomized solver gives the warning:
Note that scikit-learn's randomized PCA might not be exactly reproducible across different computational platforms. For exact reproducibility, choose
svd_solver='arpack'.This will likely become the Scanpy default in the future.
I'd like to just attach the jupyter notebook but it won't seem to let me do that so I'm copying the code below. ...
This outputs the following
Versions:
scanpy==1.4.4.post1 anndata==0.7.1 umap==0.3.10 numpy==1.18.1 scipy==1.4.1 pandas==1.0.3 scikit-learn==0.22.2.post1 statsmodels==0.11.1 python-igraph==0.8.0 louvain==0.6.1