pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
649 stars 171 forks source link

UMI count per transcript #418

Closed MengjunWu closed 10 months ago

MengjunWu commented 10 months ago

Hi,

I want to ask if UMI reads mapped to several transcripts from the same gene, for example, tx1 and tx2 belong to the same gene A and there are two sets of UMIs mapping to both tx1 and tx2 as follows: UMI set1: UMI-ACGA_read1 (tx1, tx2,) UMI-ACGA_read2(tx1, tx2) UMI-ACGA_read3(tx1, tx2)

UMI set2 UMI-AGAC_read1 (tx1, tx2,) UMI-AGAC_read2(tx1, tx2) UMI-AGAC_read3(tx1, tx2)

I understood Kallisto will collapse UMI on the gene-level, so after collapsing umi, gene A will only have two UMI reads UMI-ACGA and UMI-AGAC -- then which transcript (tx1/tx2) will the two collapsed UMI read be assigned or counted?

Many thanks, Mengjun

Yenaled commented 10 months ago

Correct, gene A will get two counts. By default, there is no transcript-level counting so you simply get a cell-by-gene count matrix. Thus, the information about tx1/tx2 is irrelevant and not taken into account after gene A gets its two counts.

In order to get transcript-level estimates from kallisto, you have to count equivalence classes and then run an EM algorithm. This can be done in bustools (don’t supply the —genecounts option, and then run kallisto quant-tcc on the resulting matrix).

MengjunWu commented 10 months ago

Many thanks for the reply! If I want to have transcript-level quantification, the estimation will be performed after UMI deduplication? -- In that case which UMI read exactly will be kept for subsequent estimation? I also observed another situation in our data as below: tx1 and tx2 are from the same gene and they are not overlap (this happens when tx1, tx2 are small subregions in a gene, they are close enough while not overlapping ). Reads with the same UMI, some are assigned to tx1 while some are assigned to tx2, all these reads are unambiguously assigned to only one transcript.

UMI-ACGA_read1 (tx1) UMI-ACGA_read2(tx1) UMI-ACGA_read3(tx1)

UMI-ACGA_read4(tx2) UMI-ACGA_read5(tx2) UMI-ACGA_read6(tx2)

In this case which read to keep after umi collapsing will affect tx1 and tx2 quantification afterwards? Does Kallisto choose a random read to represent that UMI and quant-tcc is based on this chosen UMI read?

Yenaled commented 10 months ago

Hello again! In that case, that gene will get one count since all UMIs map to that gene (albeit to different transcripts of that gene). For transcript equivalence class counting, a new equivalence class {tx1,tx2} will form which will get one count. UMI collapsing is always done at gene-level. (The concept of "reads" disappears after UMI deduplication since now we're talking about UMIs [molecules], not individual reads).

In the example above, you can imagine tx1 getting 0.5 and tx2 getting 0.5 as the transcript-level estimated count after running quant-tcc.

MengjunWu commented 10 months ago

Ah, now I get it! Thanks a lot for the explanation, it is crystal clear :)