Open nleroy917 opened 5 months ago
Firstly, the predicted total run time may not be accurate as their are many parallel jobs. Secondly, I do not recommend setting the n_job too high as each job is created as an independent system level thread, which is expensive. Inside each job we have implemented data level parallelism so using the default value of n_jobs (8) may already max out the CPU. You can monitor the cpu usage to confirm this. Thirdly, you can run scrublet on a single file to get an estimate of overall runtime. If the run time of a single file is too high (>10min), then there may be a problem.
@kaizhang got it. thank you. I'll try some of these suggestions
@kaizhang Follow up question to this... Once the AnnDataSet
is created, does that represent an merged/unified AnnData
object? That is, one matrix with one peak set representing all 650K cells?
I'm referring specifically to after batch correction:
# Store tissue types in .obs
adataset.obs['tissue'] = [x.split(':')[0] for x in adataset.obs['sample']]
snap.pp.mnc_correct(adataset, batch="sample", groupby='tissue', key_added='X_spectral')
Is it possible to export the final matrix as one singular h5ad
file?
Yes, you can convert AnnDataSet to AnnData using dataset.to_adata()
. Read more here: https://kzhang.org/epigenomics-analysis/anndata.html. Note "to_adata()" does not copy obsm, obs, etc from the underlying AnnData objects. To do this, you need to manually copy those to dataset first, for example: dataset.obsm['test'] = dataset.adatas.obsm['test']
.
An update: Bumping the ram up to ~100G seemed to help things a lot. not sure if its a system-specfic issue, but I was able to get through the tutorial without much issue, with the exception of batch correction:
This chunk gave me a bit of trouble:
# Store tissue types in .obs
adataset.obs['tissue'] = [x.split(':')[0] for x in adataset.obs['sample']]
snap.pp.mnc_correct(adataset, batch="sample", groupby='tissue', key_added='X_spectral')
It hangs indefinitely, and my job ran out before I was able to see how long it went. What would it take for there to be a progress bar for this function?
Hello! I love this package, so thank you so much for a python-native solution to scATAC analysis. It's been indispensable.
I had a question about runtime for the atlas-level dataset. I was reading through the preprint and noted this:
I wanted to actually use this data for some analysis, and so I grabbed it from geo, which was linked in the original paper for that dataset. I've also got the metadata.
Using the tutorial for analyzing this dataset, I developed this code:
Everything runs nicely, but the
snap.pp.scrublet
snippet predicts it won't finish for another 20 hours, which is quite long. Was wondering if there was something I was doing wrong - how important is then_jobs
param? I'm working with 40 cores + 64G RAM.Any help is appreciated! Thank you.