pachterlab / kallistobustools

kallisto | bustools workflow for pre-processing single-cell RNA-seq data
https://kallistobus.tools/
MIT License
115 stars 29 forks source link

Improving speed when running `kb count` #55

Open reetm09 opened 1 year ago

reetm09 commented 1 year ago

When you input multiple FASTQ files into the kb count function, does it process them sequentially or is there a way to parallelize it? Especially because for me, the first step "kallisto bus" takes the longest (when loading the index and mapping). Is there a way to parallelize this process or any other tips to improve speed?

Thank you!

Yenaled commented 1 year ago

It should automatically parallelize (rather than sequential reading) if you enable many threads -- that's one reason that splitting FASTQ files into multiple chunks enables faster processing.

kallisto should be pretty fast unless you're doing single nucleus rnaseq or rna velocity -- with enough threads, it will only take 1-3 seconds to process a million reads.

Also, make sure you're using the current version of kb-python (version 0.27.3) since speed improvements have been made.

Finally, post issues on the kallisto or the kb-python github page -- I'm usually more responsive on those pages.

reetm09 commented 1 year ago

Hi,

Thank you so much for your quick response! This is the command I'm running for RNA Velocity analysis. Currently it's taking 30-40 mins and each of the fastq's are 1000 reads, with the index file being ~40GB. Additionally, each of the files here are 119MB. Is this expected?

kb count --h5ad -i index.idx -g t2g.tsv -x 10xv2 --workflow lamanno -c1 cdna.t2g.tsv -c2 introns.t2g.tsv -o subSample1 --filter bustools -t 20 subSample1_R1.fastq.gz subSample1_R2.fastq.gz

Additionally, just to clarify once again, if I specify the following command, it should already be parallelizing? kb count --h5ad -i index.idx -g t2g.tsv -x 10xv2 --workflow lamanno -c1 cdna.t2g.tsv -c2 introns.t2g.tsv -o subSample --filter bustools -t 20 subSample1_R1.fastq.gz subSample1_R2.fastq.gz subSample2_R1.fastq.gz subSample2_R2.fastq.gz subSample3_R1.fastq.gz subSample3_R2.fastq.gz

Or do I need to do anything additional to split the FASTQ files into multiple chunks? And would the output folder (subSample) here contain the combined .h5ad file?

Thanks so much for your help!

Yenaled commented 1 year ago

OK, yes, rna velocity is just slow with kallisto. This will change in our forthcoming release of kb-python (version 0.28; currently on devel branch), which will be released in the next week or so.

I don't think there's much you can do in terms of speed with the current version of kb-python.

And yes, it will be parallelizing automatically with the command you supplied (and the output will be no different than combining the subsamples into a single fastq file).