Closed wangmeijiao closed 1 year ago
Hi Meijiao, thanks for your questions! The multi-threading strategies used in snapatac2 vary from function to function. Most functions by default will use all available cpus, which may create a problem for certain computing environments. Would you please let me know for which functions/steps are you experiencing memory or cpu issues? This will help me optimize the multithreading strategies.
Hi Kai,
In fact there were multithreading problems from many steps (for example, spectral dimension reduction step, the umap embedding step) and many of the problem just gone if I turn to memory mode (just comment out the file parameter when creat the snap object). It is too greedy to use all cpus in many practical multi-user linux platform. I also happen to find that if you predefine os.environ four parameters before numpy and polars, multithreading seems to be able to sucessfully evoked and I do not know why. Here I attached a workable iterative snapatac2 code in my enviroment (python 3.10, r4.1.3 etc) here for your and others convience.
import snapatac2 as snap snap.version #2.2.0 #2.2.0.dev0 #2.1.3
import os os.environ['NUMEXPR_MAX_THREADS'] = '1' os.environ['NUMEXPR_NUM_THREADS'] = '1'
os.environ['OPENBLAS_NUM_THREADS'] = '1' #important!!
os.environ['OMP_NUM_THREADS'] = '1'
import scanpy as sc #1.8.2
import pickle import dill
import numpy as np #1.23.4 import polars as pl #0.14.27
pl.Config.set_fmt_str_lengths(22)
#######check package version pkg_list = ['snapatac2','scanpy','scrublet','pandas','numpy','numba','loompy','anndata','umap','polars'] def print_header(pkg_list): for pkg in pkg_list: try: imp = import(pkg) print (pkg + '==' + imp.version, end = ", ") except (ImportError, AttributeError): pass
print_header(pkg_list)
snapatac2==2.2.0, scanpy==1.8.2, pandas==1.3.5, numpy==1.22.4, numba==0.55.2, loompy==3.0.7, anndata==0.8.0, umap==0.5.3, polars==0.14.12
datalist = {} h5ad_files = []
use_dims = 10 res = 0.9
for name, fl in files[1:]: print('sample: ' + name + " fragments: " + fl) h5ad_filename = name + ".multi.h5ad" data = snap.pp.import_data(fl, genome=snap.genome.hg38,
#chrom_size = "",
#min_tsse=7,
#min_num_fragments=1000,
#file=h5ad_filename, #not write to disk, do all in mem, or will cause multithreadding problem
sorted_by_barcode=False
)
h5ad_files.append((name, h5ad_filename))
##look at all barcode qulity
print('plotting tsse ')
snap.pl.tsse(data, interactive=False)
snap.pp.filter_cells(data, min_counts=5000, min_tsse=5, max_counts=50000)
snap.pl.tsse(data, interactive=False)
##prefer to add prefix to barcode within loop
print('modify cellid')
data.obs['sample'] = pl.Series('sample', [name] * data.shape[0] , dtype = 'string')
cellid = np.array(data.obs['sample']) + ":" + np.array(data.obs_names)
data.obs_names = cellid
print('add tile')
snap.pp.add_tile_matrix(data)
print('select feature (tile)')
#snap.pp.select_features(data)
snap.pp.select_features(data,blacklist="your_black_list.bed")
print('do scrublet')
snap.pp.scrublet(data)
snap.pp.call_doublets(data)
#h5ad_filename = name + ".h5ad"
#data.write(h5ad_filename)
#h5ad_files.append((name, h5ad_filename))
#snap.tl.spectral(data, sample_size = 10000) #mem big even sample down
print('do spectral dimension reduction')
snap.tl.spectral(data)
snap.pl.spectral_eigenvalues(data, interactive=False)
print('do umap')
snap.tl.umap(data, use_dims = use_dims)
snap.pl.umap(data, color="sample", interactive=False, width = 800)
print('do leiden clustering')
snap.pp.knn(data, use_rep="X_spectral", use_dims = use_dims)
snap.tl.leiden(data, resolution = res)
snap.pl.umap(data, color="leiden", interactive=False, width=500)
##add gmat
gene_matrix = snap.pp.make_gene_matrix(data, snap.genome.hg38)#, file="gene_matrix.h5ad")
#gene_matrix
#Imputation (MAGIC to perform imputation and data smoothing)
#gene_matrix.close()
#gene_matrix = sc.read("gene_matrix.h5ad")
sc.pp.filter_genes(gene_matrix, min_cells= 5)
sc.pp.normalize_total(gene_matrix)
sc.pp.log1p(gene_matrix)
#pip install magic-impute (use this)
#tmtoolkit 0.11.2 requires pandas>=1.4.0, but you have pandas 1.3.5 which is incompatible.
#Successfully installed Deprecated-1.2.13 graphtools-1.5.2 magic-impute-3.0.0 pandas-1.3.5 pygsp-0.5.1 scprep-1.2.1 tasklogger-1.2.0
sc.external.pp.magic(gene_matrix, solver="approximate")
# Copy umap embedding
gene_matrix.obsm["X_umap"] = data.obsm["X_umap"]
marker_genes = ['your gene list']
sc.pl.umap(gene_matrix, use_raw=False, color=["leiden"] + marker_genes)
#sc.pl.violin(gene_matrix, marker_genes, use_raw=False, groupby='leiden')
data.obsm['gmat'] = gene_matrix
##Peak calling at the cluster-level, add pmat
snap.tl.call_peaks(data, groupby="leiden")
peak_mat = snap.pp.make_peak_matrix(data)#, file="peak_matrix.h5ad")
data.obsm['pmat'] = peak_mat
##save to mem
datalist[name] = data
#h5ad_objs.append((name, data))
print('write to disk h5ad')
data.write(h5ad_filename)
print('done for sample' + name)
All these steps use numpy for matrix operations. Numpy by default use all CPUs. You can set the environment variables to change this behavior as you already did. I will incorporate the option for changing the number of CPUs directly to the function in the future. Thanks!
Happy to hear that!
This has been improved in the latest developmental version.
Hi Kai,
Here is my guess. I understand snapatac2 has great improvement on speed and memory efficiency. For speed, what I can guess are Rust rewritten code/openblas-multithreading/polars instead of pandas, (forgive me if I misunderstand); For low memory usage, the h5py local disk file format was used. These are pretty cool and new techniques and indeed great improve both speed and memory (at least for me).
Here is a quick question: what are the true multi-threading strategies behind snapatac 2.2.0 (github version) and what is the different if running with multiple threads in backed and memory mode? I asked because it seems to have less error message in memory mode. It will certainly be helpful for debugging if I must run on backed mode. By the way, I understand to set os.environ['NUMEXPR_MAX_THREADS'], os.environ['NUMEXPR_NUM_THREADS'], os.environ['OPENBLAS_NUM_THREADS'] os.environ['OMP_NUM_THREADS'] to low values will help, unfortunately I do not know the reason. I also understand to set ulimit -u some_big_value will help.
Best, Meijiao