pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
637 stars 168 forks source link

Read-level pseudoalignment data that includes cellbarcode and UMI? #430

Closed bbimber closed 3 months ago

bbimber commented 3 months ago

Hello,

I'm very interested in using kallisto as the alignment engine for some non-standard analyses using 10x data. I'd like to use kallisto to perform the pseudoalignments against custom reference libraries, but then apply custom logic to those results to control how we generate count data. From what I read, kallisto can produce a pseudoalignment BAM file, which is pretty close to what I need; however, this does not appear to retain cellbarcode/UMI. Are there any existing ways to export read-level alignments (and it does not necessarily need to be a BAM), that would include the barcodes?

If there isnt, I assume I could generate the barcode information myself and join on readname, but thought I would check first. Thanks in advance for any help.

Yenaled commented 3 months ago

I believe kallisto version 0.48.0 can produce pseudobams with barcodes and UMIs included (it’s been a while and I forgot whether I implemented it).

In bam options have been removed in the latest kallisto update because they’re more for genome browser visualization (alignment inspection is best done by other programs like a sppice-aware genome aligner). Read-level information (with compatible transcripts information) can be preserved in BUS files with the —num option and then extracted with bustools, if you’re interested.

bbimber commented 3 months ago

OK, I will give the --num option a look.

bbimber commented 3 months ago

@Yenaled: After reading I have two additional questions:

1) For kallisto bus, can you clarify what the docs intend for --num? It says "Output number of read in flag column (incompatible with --bam)" - does that mean the index of the read? Read name?

2) "kallisto quant" has the option to produce a pseudoalignment BAM, and it seems like this is one of the few tools that will do this. Nonetheless, for our application I dont actually care about the count quantification (we plan to run custom code to generate the count matrix from alignment data). Is there a more direct tool that just does pseudoalignment? There are posts about a 'kallisto pseudo' tool but this appears to have been removed some time ago.

I am running version 0.50.1

Yenaled commented 3 months ago
  1. It means the read number (zero-indexed; the first read has read number 0). You can run bustools text -pf on the resulting BUS file — the final column has the read number. The first and second columns have the barcodes + UMIs. The third column has the equivalence class (which corresponds to the line number, zero indexed, in the output EC file). I’ll try to write up formal documentation for this at some point.
  2. kallisto bus should be able to produce BAM files in version 0.48.0. kallisto bus just does pseudoalignment (with the main pseudoalignment results stored in BUS file).
bbimber commented 3 months ago

@Yenaled, thanks for the quick reply. To point 2 you wrote "kallisto bus should be able to produce BAM files in version 0.48.0". It does not accept a "--pseudobam" flag like quant. Is there a different option I should use to get it to export the BAM?

I am trying the option you suggested for --num, but unless there is a different way to export from BUS format (i was trying bustools text with -a), the BAM seems to contain far more read-level information than BUS, which seems to already be rolled up to UMI level.

bbimber commented 3 months ago

I misunderstood your comment above about versions. According to what you wrote here: https://github.com/pachterlab/kallisto/issues/432, only 0.48.0 supports pseudobams, rather than >0.48.0. I'm trying this using that version.