kaizhang / SnapATAC2

Single-cell epigenomics analysis tools
https://kzhang.org/SnapATAC2/
219 stars 23 forks source link

Different multithreading strategies behind snapatac2 backed and memory mode? #90

Closed wangmeijiao closed 1 year ago

wangmeijiao commented 1 year ago

Hi Kai,

Here is my guess. I understand snapatac2 has great improvement on speed and memory efficiency. For speed, what I can guess are Rust rewritten code/openblas-multithreading/polars instead of pandas, (forgive me if I misunderstand); For low memory usage, the h5py local disk file format was used. These are pretty cool and new techniques and indeed great improve both speed and memory (at least for me).

Here is a quick question: what are the true multi-threading strategies behind snapatac 2.2.0 (github version) and what is the different if running with multiple threads in backed and memory mode? I asked because it seems to have less error message in memory mode. It will certainly be helpful for debugging if I must run on backed mode. By the way, I understand to set os.environ['NUMEXPR_MAX_THREADS'], os.environ['NUMEXPR_NUM_THREADS'], os.environ['OPENBLAS_NUM_THREADS'] os.environ['OMP_NUM_THREADS'] to low values will help, unfortunately I do not know the reason. I also understand to set ulimit -u some_big_value will help.

Best, Meijiao

kaizhang commented 1 year ago

Hi Meijiao, thanks for your questions! The multi-threading strategies used in snapatac2 vary from function to function. Most functions by default will use all available cpus, which may create a problem for certain computing environments. Would you please let me know for which functions/steps are you experiencing memory or cpu issues? This will help me optimize the multithreading strategies.

wangmeijiao commented 1 year ago

Hi Kai,

In fact there were multithreading problems from many steps (for example, spectral dimension reduction step, the umap embedding step) and many of the problem just gone if I turn to memory mode (just comment out the file parameter when creat the snap object). It is too greedy to use all cpus in many practical multi-user linux platform. I also happen to find that if you predefine os.environ four parameters before numpy and polars, multithreading seems to be able to sucessfully evoked and I do not know why. Here I attached a workable iterative snapatac2 code in my enviroment (python 3.10, r4.1.3 etc) here for your and others convience.

workable code

import snapatac2 as snap snap.version #2.2.0 #2.2.0.dev0 #2.1.3

import os os.environ['NUMEXPR_MAX_THREADS'] = '1' os.environ['NUMEXPR_NUM_THREADS'] = '1'

solve for "BLAS : Program is Terminated. Because you tried to allocate too many memory regions." error

os.environ['OPENBLAS_NUM_THREADS'] = '1' #important!!

os.environ['GOTO_NUM_THREADS'] = '1'

os.environ['OMP_NUM_THREADS'] = '1'

import scanpy #1.8.2 #1.6.1

import scanpy as sc #1.8.2

import pickle import dill

import numpy as np #1.23.4 import polars as pl #0.14.27

pl.Config.set_fmt_str_lengths(22)

sc.logging.print_header()

scanpy==1.8.2 anndata==0.8.0 umap==0.5.3 numpy==1.22.4 scipy==1.8.1 pandas==1.3.5 scikit-learn==1.1.1 statsmodels==0.13.2 python-igraph==0.10.2 louvain==0.7.1 leidenalg==0.7.0 pynndescent==0.5.7

#######check package version pkg_list = ['snapatac2','scanpy','scrublet','pandas','numpy','numba','loompy','anndata','umap','polars'] def print_header(pkg_list): for pkg in pkg_list: try: imp = import(pkg) print (pkg + '==' + imp.version, end = ", ") except (ImportError, AttributeError): pass

print_header(pkg_list)

snapatac2==2.2.0, scanpy==1.8.2, pandas==1.3.5, numpy==1.22.4, numba==0.55.2, loompy==3.0.7, anndata==0.8.0, umap==0.5.3, polars==0.14.12

snapatac2==2.2.0.dev0, scanpy==1.8.2, pandas==1.3.5, numpy==1.22.4, numba==0.55.2, loompy==3.0.7, anndata==0.8.0

see each library quality and plot

datalist = {} h5ad_files = []

h5ad_objs = []

use_dims = 10 res = 0.9

name, fl = files[3] #ok

name, fl = files[0] #ok

for name, fl in files[1:]: print('sample: ' + name + " fragments: " + fl) h5ad_filename = name + ".multi.h5ad" data = snap.pp.import_data(fl, genome=snap.genome.hg38,

gff_file = "",

                           #chrom_size = "",
                           #min_tsse=7, 
                           #min_num_fragments=1000, 
                           #file=h5ad_filename, #not write to disk, do all in mem, or will cause multithreadding problem
                           sorted_by_barcode=False
                          )
h5ad_files.append((name, h5ad_filename))

##look at all barcode qulity

print('plotting tsse ')
snap.pl.tsse(data, interactive=False)

snap.pp.filter_cells(data, min_counts=5000, min_tsse=5, max_counts=50000)

snap.pl.tsse(data, interactive=False)

##prefer to add prefix to barcode within loop

print('modify cellid')
data.obs['sample'] = pl.Series('sample', [name] * data.shape[0] , dtype = 'string')
cellid = np.array(data.obs['sample']) + ":" + np.array(data.obs_names)
data.obs_names = cellid

print('add tile')
snap.pp.add_tile_matrix(data)

print('select feature (tile)')
#snap.pp.select_features(data)
snap.pp.select_features(data,blacklist="your_black_list.bed")

print('do scrublet')
snap.pp.scrublet(data)
snap.pp.call_doublets(data)

#h5ad_filename = name + ".h5ad"
#data.write(h5ad_filename)
#h5ad_files.append((name, h5ad_filename))

#snap.tl.spectral(data, sample_size = 10000) #mem big even sample down
print('do spectral dimension reduction')
snap.tl.spectral(data)

snap.pl.spectral_eigenvalues(data, interactive=False)

print('do umap')
snap.tl.umap(data, use_dims = use_dims)
snap.pl.umap(data, color="sample", interactive=False, width = 800)

print('do leiden clustering')
snap.pp.knn(data, use_rep="X_spectral", use_dims = use_dims)
snap.tl.leiden(data, resolution = res)
snap.pl.umap(data, color="leiden", interactive=False, width=500)

##add gmat
gene_matrix = snap.pp.make_gene_matrix(data, snap.genome.hg38)#, file="gene_matrix.h5ad")
#gene_matrix

#Imputation (MAGIC to perform imputation and data smoothing)

#gene_matrix.close()
#gene_matrix = sc.read("gene_matrix.h5ad")

sc.pp.filter_genes(gene_matrix, min_cells= 5)
sc.pp.normalize_total(gene_matrix)
sc.pp.log1p(gene_matrix)

#pip install magic-impute (use this)
#tmtoolkit 0.11.2 requires pandas>=1.4.0, but you have pandas 1.3.5 which is incompatible.
#Successfully installed Deprecated-1.2.13 graphtools-1.5.2 magic-impute-3.0.0 pandas-1.3.5 pygsp-0.5.1 scprep-1.2.1 tasklogger-1.2.0

sc.external.pp.magic(gene_matrix, solver="approximate")

# Copy umap embedding
gene_matrix.obsm["X_umap"] = data.obsm["X_umap"]

marker_genes = ['your gene list']

sc.pl.umap(gene_matrix, use_raw=False, color=["leiden"] + marker_genes)
#sc.pl.violin(gene_matrix, marker_genes, use_raw=False, groupby='leiden')

data.obsm['gmat'] = gene_matrix

##Peak calling at the cluster-level, add pmat
snap.tl.call_peaks(data, groupby="leiden")

peak_mat = snap.pp.make_peak_matrix(data)#, file="peak_matrix.h5ad")
data.obsm['pmat'] = peak_mat

##save to mem
datalist[name] = data
#h5ad_objs.append((name, data))

print('write to disk h5ad')
data.write(h5ad_filename)

print('done for sample' + name)
kaizhang commented 1 year ago

All these steps use numpy for matrix operations. Numpy by default use all CPUs. You can set the environment variables to change this behavior as you already did. I will incorporate the option for changing the number of CPUs directly to the function in the future. Thanks!

wangmeijiao commented 1 year ago

Happy to hear that!

kaizhang commented 1 year ago

This has been improved in the latest developmental version.