Runtime question - atlas-scale analysis

nleroy917 commented 5 months ago

Hello! I love this package, so thank you so much for a python-native solution to scATAC analysis. It's been indispensable.

I had a question about runtime for the atlas-level dataset. I was reading through the preprint and noted this:

This atlas comprises 92 scATAC-seq139 samples, roughly 650,000 cells, over 23 billion raw reads, and has a total size of 1.6 terabytes. As shown in Fig. 1e,140 SnapATAC2 completed the entire process in 5.22 hours, whereas ArchR took 14.27 hours. At this scale, SnapATAC2141 is almost three times faster than ArchR, decreasing computational costs by approximately 63.4%

I wanted to actually use this data for some analysis, and so I grabbed it from geo, which was linked in the original paper for that dataset. I've also got the metadata.

Using the tutorial for analyzing this dataset, I developed this code:

import os
import multiprocessing as mp
from pathlib import Path
from glob import glob

import snapatac2 as snap
import numpy as np

from tqdm import tqdm

snap.__version__

data_dir = os.path.expandvars("$DATA/geniml/datasets/scatlas/fragments")
output_dir = os.path.expandvars("$DATA/geniml/datasets/scatlas/h5ad")

if not os.path.exists(output_dir):
    os.makedirs(output_dir)

fragment_files = glob(os.path.join(data_dir, "*.bed.gz"))

outputs = []
for file in fragment_files:
    new_file = Path(file).name.replace(".bed.gz", ".h5ad")
    output_file = os.path.join(output_dir, new_file)
    outputs.append(output_file)

adatas = snap.pp.import_data(
    fragment_files, 
    file=outputs, 
    chrom_sizes=snap.genome.hg38,
    min_num_fragments=1000, 
)

snap.pp.add_tile_matrix(adatas, bin_size=5000, n_jobs=mp.cpu_count() - 2)
snap.pp.select_features(adatas, n_jobs=mp.cpu_count() - 2)
snap.pp.scrublet(adatas, n_jobs=mp.cpu_count() - 2)
snap.pp.filter_doublets(adatas, n_jobs=mp.cpu_count() - 2)

adataset = snap.AnnDataSet(
    adatas=[(f.filename.split('/')[-1].split('.h5ad')[0], f) for f in adatas],
    filename="data.h5ads"
)

adataset.obs_names = np.array(adataset.obs['sample']) + '+' + np.array(adataset.obs_names)

snap.pp.select_features(adataset, n_features=250000)
snap.tl.spectral(adataset)

# Store tissue types in .obs
adataset.obs['tissue'] = [x.split(':')[0] for x in adataset.obs['sample']]
snap.pp.mnc_correct(adataset, batch="sample", groupby='tissue', key_added='X_spectral')

Everything runs nicely, but the snap.pp.scrublet snippet predicts it won't finish for another 20 hours, which is quite long. Was wondering if there was something I was doing wrong - how important is the n_jobs param? I'm working with 40 cores + 64G RAM.

Any help is appreciated! Thank you.

kaizhang commented 5 months ago

Firstly, the predicted total run time may not be accurate as their are many parallel jobs. Secondly, I do not recommend setting the n_job too high as each job is created as an independent system level thread, which is expensive. Inside each job we have implemented data level parallelism so using the default value of n_jobs (8) may already max out the CPU. You can monitor the cpu usage to confirm this. Thirdly, you can run scrublet on a single file to get an estimate of overall runtime. If the run time of a single file is too high (>10min), then there may be a problem.

nleroy917 commented 5 months ago

@kaizhang got it. thank you. I'll try some of these suggestions

nleroy917 commented 5 months ago

@kaizhang Follow up question to this... Once the AnnDataSet is created, does that represent an merged/unified AnnData object? That is, one matrix with one peak set representing all 650K cells?

I'm referring specifically to after batch correction:

# Store tissue types in .obs
adataset.obs['tissue'] = [x.split(':')[0] for x in adataset.obs['sample']]
snap.pp.mnc_correct(adataset, batch="sample", groupby='tissue', key_added='X_spectral')

Is it possible to export the final matrix as one singular h5ad file?

kaizhang commented 5 months ago

Yes, you can convert AnnDataSet to AnnData using dataset.to_adata(). Read more here: https://kzhang.org/epigenomics-analysis/anndata.html. Note "to_adata()" does not copy obsm, obs, etc from the underlying AnnData objects. To do this, you need to manually copy those to dataset first, for example: dataset.obsm['test'] = dataset.adatas.obsm['test'].

nleroy917 commented 4 months ago

An update: Bumping the ram up to ~100G seemed to help things a lot. not sure if its a system-specfic issue, but I was able to get through the tutorial without much issue, with the exception of batch correction:

This chunk gave me a bit of trouble:

# Store tissue types in .obs
adataset.obs['tissue'] = [x.split(':')[0] for x in adataset.obs['sample']]
snap.pp.mnc_correct(adataset, batch="sample", groupby='tissue', key_added='X_spectral')

It hangs indefinitely, and my job ran out before I was able to see how long it went. What would it take for there to be a progress bar for this function?

kaizhang / SnapATAC2

Runtime question - atlas-scale analysis #223