pachterlab / kallistobustools

kallisto | bustools workflow for pre-processing single-cell RNA-seq data
https://kallistobus.tools/
MIT License
115 stars 30 forks source link

How to get isoform counts for single cell RNA-seq data? #48

Open biopzhang opened 1 year ago

biopzhang commented 1 year ago

Great tool that integrates lots of functions!

I was wondering if there is a way to get the isoform counts. I was trying to get the isoform counts following your Nature paper (specifically https://github.com/pachterlab/BYVSTZP_2020).

You mentioned that for the 10xv3 data, "gene-count matrices were made by using the -genecounts flag and TCC matrices were made by omitting it". It works great for the gene-count part with the following command:

$ kb count --h5ad -i index.idx -g t2g.txt -x 10xv3 -o XXX -m 64G --workflow standard --filter bustools -t 32

I got the cells x genes matrix both in the mtx and h5ad format.

My question is, how to get a cells x transcripts matrix? It does not seem to work by simply adding the "--tcc" to the above command. I can get a cells x tcc mtx, but not the cells x transcripts mtx. Moreover, I don't know how to apply or omit the "--genecounts" flag.

Thank you so much! P.

Yenaled commented 1 year ago

Currently, kb count only does transcript quantification for bulk/smart-seq data (where each sample or cell is in a separate FASTA file).

For 10X type data, kb count stops at the cells x tcc mtx. However, you can run "kallisto quant-tcc" on the cells x tcc mtx to try to get transcript quantification.

biopzhang commented 1 year ago

Thank you for your quick reply, Yenaled!

I was testing this on the forebrain glutamatergic neuronal lineage data in the KBtools tutorial. The kb count tcc matrix (394,494 x 6,238,208) is huge for the kallisto quant-tcc step. It runs forever even on an HPC cluster node (64 cores, ~ TB memory; 12 hours now, still running). I think probably I should only take the cells according to other studies, such as in the RNA velocity study (only about 1800 cells are kept). Could you please commend on this?

Yenaled commented 1 year ago

Oh, with such a large matrix, it's computationally intractable. You will definitely need to filter cells.

The EM algorithm (which gives you transcript counts) in quant-tcc only takes a few seconds to run, but if you multiply a few seconds by hundreds of thousands of cells, well, you do the math of how long it'll take to run.