Naive collapsing of UMIs in SS3xpress data?

biobenkj commented 2 months ago

Describe the issue I'm curious as to the result of supp. figure 1.1 in "Modular, efficient and constant-memory single-cell RNA-seq preprocessing". In the presence of a sufficiently long UMI (e.g. >= 10 bp), is this suggesting that one can collapse UMIs "naively" in the sense of not having --umi-gene as part of the bustools count call within kb python? I recognize that the default assumes --umi-gene. I ask because I'm working through some ss3xpress data that has a 10 bp UMI and am wondering how best to collapse these.

Thank you!

What is the exact command that was run?

# within kb count
#count
bustools count \
-o counts_unfiltered/cells_x_tcc \
-g ens101_ercc92_t2g.txt \
-e matrix.ec \
-t transcripts.txt \
--multimapping \
output.s.bus

biobenkj commented 2 months ago

Nevermind - I misunderstood something and upon re-reading, it makes sense now what --umi-gene is achieving in bustools count versus if that option is not thrown. Apologies for the spam!

biobenkj commented 2 months ago

Okay - I actually need to reopen this issue. Intuitively, if I leave --umi-gene off I would expect to have greater numbers of transcripts since we are considering the cell barcode:umi:transcript triplet for UMI collapsing. The converse being true if --umi-gene is turned on. But what I observe is actually the opposite - with --umi-gene I have far greater numbers of genes detected and transcripts vs when the option is not given.

Code broken out instead of using the kb_python wrapper for testing

Without `--umi-gene`

#launch kallisto bus
kallisto bus \
-t 8 \
-i ref/ens101_ercc92.idx \
-o ${SAMPLE[i]}_kb_tcc \
-x 0,6,8:0,0,6:0,14,0 \
--fr-stranded \
${SAMPLE[i]}_cbc_trimmed_homoATCG.fixed.100k.fastq.gz

#launch bustools and quant-tcc
cd ${SAMPLE[i]}_kb_tcc

#sort
bustools sort -o output.s.bus output.bus

#inspect
bustools inspect -o inspect.json output.s.bus

# mkdir
mkdir counts_unfiltered

#count
bustools count \
-o counts_unfiltered/cells_x_tcc \
-g ref/ens101_ercc92_t2g.txt \
-e matrix.ec \
-t transcripts.txt \
--multimapping \
output.s.bus

kallisto quant-tcc \
-o quant_tcc \
-i ref/ens101_ercc92.idx \
-e counts_unfiltered/cells_x_tcc.ec.txt \
-t 8 \
-l 550 \
-s 300 \
-g ref/ens101_ercc92_t2g.txt \
-b 50 \
--matrix-to-files \
counts_unfiltered/cells_x_tcc.mtx

With `--umi-gene`

#launch kallisto bus
kallisto bus \
-t 8 \
-i ref/ens101_ercc92.idx \
-o ${SAMPLE[i]}_kb_tcc \
-x 0,6,8:0,0,6:0,14,0 \
--fr-stranded \
${SAMPLE[i]}_cbc_trimmed_homoATCG.fixed.100k.fastq.gz

#launch bustools and quant-tcc
cd ${SAMPLE[i]}_kb_tcc

#sort
bustools sort -o output.s.bus output.bus

#inspect
bustools inspect -o inspect.json output.s.bus

# mkdir
mkdir counts_unfiltered

#count
bustools count \
-o counts_unfiltered/cells_x_tcc \
-g ref/ens101_ercc92_t2g.txt \
-e matrix.ec \
-t transcripts.txt \
--multimapping \
--umi-gene \
output.s.bus

kallisto quant-tcc \
-o quant_tcc \
-i ref/ens101_ercc92.idx \
-e counts_unfiltered/cells_x_tcc.ec.txt \
-t 8 \
-l 550 \
-s 300 \
-g ref/ens101_ercc92_t2g.txt \
-b 50 \
--matrix-to-files \
counts_unfiltered/cells_x_tcc.mtx

Do you have any insight into what is happening here? I expect that it's independent of UMI length (given at least a length of 6, based on https://github.com/pachterlab/MBLGLMBHGP_2021/blob/master/Supplementary_Figure_12/supp_figure_12.ipynb)

Thank you!!

Yenaled commented 2 months ago

When --umi-gene is supplied, you will indeed have more genes detected.

Let's say, for a given cell, you have a 3-bp UMI sequence AAA. You encounter that "AAA" sequence twice. The first time you encounter it, the corresponding read maps to gene A. The second time you encounter it, the corresponding read maps to gene B. Naively, one could say "ok, let's simply discard AAA" (which is what happens when you don't supply --umi-gene). However, the --umi-gene algorithm will say: "OK, AAA is actually two distinct molecules that coincidentally got the same identifier attached: We should give gene A a +1 count and gene B a +1 count". Thus, --umi-gene will cause more genes to be detected.

The reason --umi-gene exists, is that, for example, if you have a 4-bp UMI, there are only 4^4 (256) possible UMI sequences. If a cell has 15000 reads, then we're already oversaturating the UMI combinatorial space and it will often be the case that distinct molecules got the same UMI sequence during the library prep.

biobenkj commented 2 months ago

I see - thank you! I guess I was under the impression that if AAA was encountered for transcript X and also encountered for transcript Y, that it would consider the CB:UMI:Transcript/ec triplet, independent of gene annotation. This was based on the supplemental figure 5 from the kallisto bustools paper. But upon rereading the figure legend, if I wanted that to be the case, I would instead want to turn on the multiplicity counting (--cm) instead of counting UMIs by gene in bustools count? Thank you again and I apologize for the naivety.

Yenaled commented 2 months ago

If AAA is encountered twice for two different transcripts (or even two identical transcripts) of the same gene, that gene would get a count of 1 because that’s how UMI collapsing works. If you want that gene to get a count of 2, then, yes, that’s what the —cm option is for (UMI collapsing is not performed).

biobenkj commented 2 months ago

Thank you for clarifying. I'm more interested in keeping things at the transcript level instead of collapsing to the gene level.

pachterlab / kb_python