Open tsofiya opened 11 months ago
bustools count, by default, collapses the UMIs. If you want to ignore the UMIs when counting reads, use bustools count with the --cm option.
Thank you. It looks quite good, though sometimes I get more UMIs then reads which is weird. I use: bustools count -o no_collapsing/cells_x_genes -e matrix.ec -t transcripts.txt --genecounts output.unfiltered.bus -g kallisto/busIndex/kalisto_t2g.txt -cm with and without the -cm and compare
That can definitely happen.
Let's say you have 3 reads with the exact same UMI.
When doing UMI collapsing, you'll get that UMI counted because it will all be collapsed to gene B. However, when you do --cm, none of those reads will be counted because they all map to multiple genes. By default, things that map to multiple genes are always discarded.
Hi,
Following this question, I want to ask how do you collapse UMI to generate cell x transcript equivalence class count table if pseudoaligned reads to transcripts? Do you still collapse umi on the gene level first and then count UMI in individual transcript, or you collapse UMI on each transcript independently.
Many thanks, Mengjun
UMIs are always collapsed at gene-level regardless. The final "collapsed" UMI should belong to a single gene (and the equivalence class would contain multiple transcripts associated with that gene).
For example, if you have: UMI sequence ATCG: tx 0, tx 1 UMI sequence ATCG: tx 1, tx 2
If tx 0 and tx1 belong to gene A while tx 2 belong to gene B, the equivalence class of the collapse UMI sequence will simply have one item in it: tx 1.
If you have: UMI sequence ATCG: tx 0, tx 1 UMI sequence ATCG: tx 2
Then we don't count that UMI because it maps to two different genes (i.e. the {tx0,tx1,tx2} equivalence class is NOT counted).
Many thanks! To confirm if I understood correctly:
In your first example: the two UMI sequences are two different reads, i.e. UMI sequence ATCG: tx 0, tx 1 (read1) UMI sequence ATCG: tx 1, tx 2 (read2)
While in the second example, it is a multimapping problem and the two UMI sequences are the same read just mapped to different loci, i.e. UMI sequence ATCG: tx 0, tx 1 (read 1) UMI sequence ATCG: tx 2 (read 1)
If in the second example the two UMI sequences are from different reads e.g. UMI sequence ATCG: tx 0, tx 1 (read 1) UMI sequence ATCG: tx 2 (read 2) After collapsing, both UMI should be kept: read1 UMI will have {tx1, tx0} associated with geneA while read2 UMI will have tx2 associated with geneB.
Is this correct? Thanks!
@MengjunWu:
While in the second example, it is a multimapping problem and the two UMI sequences are the same read just mapped to different loci, i.e. UMI sequence ATCG: tx 0, tx 1 (read 1) UMI sequence ATCG: tx 2 (read 1)
I don't thinks that necessarily true: Even in the case of
UMI sequence ATCG: tx 0, tx 1 (read 1) UMI sequence ATCG: tx 2 (read 2)
the data would still suggest that there was a molecule (identified by the UMI), that on the one hand contains a subsequence that's compatible with gene A only, but on the other hand also features a subsequence that is exclusively found in gene B. Excluding fusion transcripts, this is incompatible with either gene resulting in both reads getting dropped
@Yenaled: Please correct me if I am wrong.
@mschilli87 Thanks a lot for the alternative scenario! Got it and I think it makes sense.
Hi, I have toy sample with only one barcode, and I used kallisto bus to create gene matrix. I want to create a plot of reads vs UMIs for each gene. The out put is, obviously, already collapsed and so I basically get the number of UMIs I had. How can I get the number of reads? Thank you, Tsofiya