pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
649 stars 171 forks source link

UMI count per gene #408

Open tsofiya opened 11 months ago

tsofiya commented 11 months ago

Hi, I have toy sample with only one barcode, and I used kallisto bus to create gene matrix. I want to create a plot of reads vs UMIs for each gene. The out put is, obviously, already collapsed and so I basically get the number of UMIs I had. How can I get the number of reads? Thank you, Tsofiya

Yenaled commented 11 months ago

bustools count, by default, collapses the UMIs. If you want to ignore the UMIs when counting reads, use bustools count with the --cm option.

tsofiya commented 11 months ago

Thank you. It looks quite good, though sometimes I get more UMIs then reads which is weird. I use: bustools count -o no_collapsing/cells_x_genes -e matrix.ec -t transcripts.txt --genecounts output.unfiltered.bus -g kallisto/busIndex/kalisto_t2g.txt -cm with and without the -cm and compare

Yenaled commented 11 months ago

That can definitely happen.

Let's say you have 3 reads with the exact same UMI.

When doing UMI collapsing, you'll get that UMI counted because it will all be collapsed to gene B. However, when you do --cm, none of those reads will be counted because they all map to multiple genes. By default, things that map to multiple genes are always discarded.

MengjunWu commented 10 months ago

Hi,

Following this question, I want to ask how do you collapse UMI to generate cell x transcript equivalence class count table if pseudoaligned reads to transcripts? Do you still collapse umi on the gene level first and then count UMI in individual transcript, or you collapse UMI on each transcript independently.

Many thanks, Mengjun

Yenaled commented 10 months ago

UMIs are always collapsed at gene-level regardless. The final "collapsed" UMI should belong to a single gene (and the equivalence class would contain multiple transcripts associated with that gene).

For example, if you have: UMI sequence ATCG: tx 0, tx 1 UMI sequence ATCG: tx 1, tx 2

If tx 0 and tx1 belong to gene A while tx 2 belong to gene B, the equivalence class of the collapse UMI sequence will simply have one item in it: tx 1.

If you have: UMI sequence ATCG: tx 0, tx 1 UMI sequence ATCG: tx 2

Then we don't count that UMI because it maps to two different genes (i.e. the {tx0,tx1,tx2} equivalence class is NOT counted).

MengjunWu commented 10 months ago

Many thanks! To confirm if I understood correctly:

In your first example: the two UMI sequences are two different reads, i.e. UMI sequence ATCG: tx 0, tx 1 (read1) UMI sequence ATCG: tx 1, tx 2 (read2)

While in the second example, it is a multimapping problem and the two UMI sequences are the same read just mapped to different loci, i.e. UMI sequence ATCG: tx 0, tx 1 (read 1) UMI sequence ATCG: tx 2 (read 1)

If in the second example the two UMI sequences are from different reads e.g. UMI sequence ATCG: tx 0, tx 1 (read 1) UMI sequence ATCG: tx 2 (read 2) After collapsing, both UMI should be kept: read1 UMI will have {tx1, tx0} associated with geneA while read2 UMI will have tx2 associated with geneB.

Is this correct? Thanks!

mschilli87 commented 10 months ago

@MengjunWu:

While in the second example, it is a multimapping problem and the two UMI sequences are the same read just mapped to different loci, i.e. UMI sequence ATCG: tx 0, tx 1 (read 1) UMI sequence ATCG: tx 2 (read 1)

I don't thinks that necessarily true: Even in the case of

UMI sequence ATCG: tx 0, tx 1 (read 1) UMI sequence ATCG: tx 2 (read 2)

the data would still suggest that there was a molecule (identified by the UMI), that on the one hand contains a subsequence that's compatible with gene A only, but on the other hand also features a subsequence that is exclusively found in gene B. Excluding fusion transcripts, this is incompatible with either gene resulting in both reads getting dropped

@Yenaled: Please correct me if I am wrong.

MengjunWu commented 10 months ago

@mschilli87 Thanks a lot for the alternative scenario! Got it and I think it makes sense.