pachterlab / kb_python

A wrapper for the kallisto | bustools workflow for single-cell RNA-seq pre-processing
https://www.kallistobus.tools/
BSD 2-Clause "Simplified" License
141 stars 24 forks source link

How to get isoform specific UMI count matrix #164

Closed lzx325 closed 2 years ago

lzx325 commented 2 years ago

Hi developers, The current default usage of kb count is to generate a gene-level UMI count matrix. I wonder if it is possible to get the UMI count for each specific isoform using kb? Thank you if you can point me to some tutorial or examples!

Yenaled commented 2 years ago

For UMI-labeled data, we do not think it's possible because only the 5' end or 3' end is sequenced. You can get transcript compatibility counts by using the --tcc option though.

For smartseq3 and for smartseq2 data, running kb count with the --tcc option will actually get you transcript-level expression.

lzx325 commented 2 years ago

Dear Yenaled, Thank you for the clarification. For 10x, even though only the 3' end is sequenced, is it still possible to perform differential usage analysis of some 3' alternative splicing events using the transcript compatibility counts? Could you please point me to some resource or publications regarding this? Thanks!

Yenaled commented 2 years ago

Yes, it's possible and it's exactly what was done here:

https://pubmed.ncbi.nlm.nih.gov/30664774/

lzx325 commented 2 years ago

Dear Yenaled, As the transcript compatibility counts are difficult to integrate with other downstream analysis tookits, do you think it is still OK to use some simple post-processing (e.g., uniformly distribute UMI count to each isoform in the equivalence class) to convert transcript compatibility counts to per-isoform UMI counts?

Yenaled commented 2 years ago

No, because of the nature of the reads, there's an identifiability problem which is why we don't support such practices.

That said, you can still try running the EM algorithm to distribute counts among isoforms (akin to what is done when running "kallisto quant" on bulk RNAseq samples). I just don't recommend it for the reason above and we do not currently endorse such practices. (The EM algorithm can be run by using "kallisto quant-tcc" on the cells_x_genes.mtx file)

lzx325 commented 2 years ago

Dear Yenaled, I am using kallisto 0.46.1, but I cannot kallisto quant having a -tcc option to only run the EM algorithm. It all starts from fastq files. What am I missing here?

Yenaled commented 2 years ago

You need to upgrade to the latest version (0.48.0).

lzx325 commented 2 years ago

It is working. Thank you very much!