pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
658 stars 172 forks source link

does kallisto require unsorted reads? #396

Open anoronh4 opened 1 year ago

anoronh4 commented 1 year ago

We have a situation in which many of our samples have UMI and we'd like to deduplicate before running Kallisto. The question is, does kallisto require unsorted reads? Our STAR-aligned bams are position sorted, so when we convert the bam to fastq the reads will also be sorted. Can these sorted reads serve as input as-is for kallisto?

Yenaled commented 1 year ago

kallisto doesn't require unsorted reads unless you're doing paired-end (R1+R2) alignment to get TPMs (in which case the first few thousand reads are used to estimate fragment length). In most cases, you can use your reads as-is.

But why not just deduplicate UMIs using kallisto bustools?

anoronh4 commented 1 year ago

We have UMI in bulk RNA-seq, not scRNAseq. i guess we went for vanilla kallisto first because bustools seemed to be specifically for scRNAseq with cell and molecular barcodes, but i think my colleague found your post on running it on bulk RNA with UMI: https://www.biostars.org/p/9554392/#9559079 . ours is a little different because we don't have an R3 fastq (just R1 and R2), and we have been using UMI tools to put the UMI in the header, which obviously won't work for kallisto-D and bustools. do you have any guidance on how to prepare our bulk reads for kallisto-D/bustools? (getting the R3 read at the demultiplexing step is currently not an option).

As for our current application, it probably does matter that it's position-sorted, because we are running kallisto quant on paired end data. if we were to run in paired end mode. i guess we can use -l and -s to bypass this method of estimation.

Yenaled commented 1 year ago

A couple of things:

  1. In the current stable version (0.50.0 -- I already merged kallisto-D into master), you can use bulk+UMIs with kallisto bus; you just have to supply the correct technology string to kallisto bus --paired -x; e.g. -1,0,0:0,0,10:0,10,0,1,0,0. -1,0,0 means ignore barcodes (every read gets the same arbitrary barcode), 0,0,10 means extract first 10 bp's in R1 as the UMI; 0,10,0,1,0,0 means extract position 10 onward in R1 as the first read in the pair (0,10,0)and extract everything in R2 as the second read in the pair (1,0,0).

  2. You can extract UMIs from the read headers directly (if the read headers are formatted as @readname RX:Z:TACGAGATCA), by formatting the technology string as -1,0,0:RX:0,10,0,1,0,0. (aka put RX into the UMI field of the technology string).

Then you can use quant-tcc (as I did in my biostars post) to get the abundance estimates (e.g. TPMs) like you would do in standard bulk RNA-seq.

These things are undocumented and I'm still writing documentation for these things.

anoronh4 commented 1 year ago

oh ok, so it looks like we don't need an R3 read file in that case, if i'm reading that right, which is great! i think we would prefer the first method for now since UMI-tools doesn't really follow that convention, for example:

@IID:212:FID:2:1101:1072:1000_TGCTTA 1:N:0:TAACCGGT+ATCGTCTC

where TGCTTA is the UMI. so if we went with the first option, is there also support for dual umis, for example if there's 3 bp in R1 and 3 bp in R2? In the example above, TGC comes from the beginning of R1 and TTA comes from the beginning of R2.

Yenaled commented 1 year ago

Yes, you would use this:

-1,0,0:0,0,3,1,0,3:0,3,0,1,3,0

Means you take the first 3-bp's of R1 and R2, and align your biological read from position 3 onward in R1 and R2.

anoronh4 commented 1 year ago

that's super helpful, thanks so much! i'm impressed at how flexible kallisto and its related tools are. it's going to save us a lot of storage and processing time, i think.