Open reetm09 opened 1 year ago
It should automatically parallelize (rather than sequential reading) if you enable many threads -- that's one reason that splitting FASTQ files into multiple chunks enables faster processing.
kallisto should be pretty fast unless you're doing single nucleus rnaseq or rna velocity -- with enough threads, it will only take 1-3 seconds to process a million reads.
Also, make sure you're using the current version of kb-python (version 0.27.3) since speed improvements have been made.
Finally, post issues on the kallisto or the kb-python github page -- I'm usually more responsive on those pages.
Hi,
Thank you so much for your quick response! This is the command I'm running for RNA Velocity analysis. Currently it's taking 30-40 mins and each of the fastq's are 1000 reads, with the index file being ~40GB. Additionally, each of the files here are 119MB. Is this expected?
kb count --h5ad -i index.idx -g t2g.tsv -x 10xv2 --workflow lamanno -c1 cdna.t2g.tsv -c2 introns.t2g.tsv -o subSample1 --filter bustools -t 20 subSample1_R1.fastq.gz subSample1_R2.fastq.gz
Additionally, just to clarify once again, if I specify the following command, it should already be parallelizing?
kb count --h5ad -i index.idx -g t2g.tsv -x 10xv2 --workflow lamanno -c1 cdna.t2g.tsv -c2 introns.t2g.tsv -o subSample --filter bustools -t 20 subSample1_R1.fastq.gz subSample1_R2.fastq.gz subSample2_R1.fastq.gz subSample2_R2.fastq.gz subSample3_R1.fastq.gz subSample3_R2.fastq.gz
Or do I need to do anything additional to split the FASTQ files into multiple chunks? And would the output folder (subSample
) here contain the combined .h5ad file?
Thanks so much for your help!
OK, yes, rna velocity is just slow with kallisto. This will change in our forthcoming release of kb-python (version 0.28; currently on devel branch), which will be released in the next week or so.
I don't think there's much you can do in terms of speed with the current version of kb-python.
And yes, it will be parallelizing automatically with the command you supplied (and the output will be no different than combining the subsamples into a single fastq file).
When you input multiple FASTQ files into the
kb count
function, does it process them sequentially or is there a way to parallelize it? Especially because for me, the first step "kallisto bus" takes the longest (when loading the index and mapping). Is there a way to parallelize this process or any other tips to improve speed?Thank you!