pachterlab / kallistobustools

kallisto | bustools workflow for pre-processing single-cell RNA-seq data
https://kallistobus.tools/
MIT License
115 stars 29 forks source link

Verifying status of "uniqueness" of alignments for single cell analysis #54

Open sknaack opened 1 year ago

sknaack commented 1 year ago

I've a question regarding the "uniqueness" of pseudo alignments and how this is handled in Kallisto/Bustools. I've prepared a mouse GRCm39 transcriptome reference from ENSEMBL R109 mouse trascriptome and genome .fa, as well as .gtf files, using the recommended kb ref usage of "ref -i index.idx -g t2g.txt -f1 transcriptome.fa ". This produced a working index for GRCm39, which I've utilized successfully on a set of single-cell data. My questions are as follow:

  1. Very low percentages of uniquely pseudo-aligned reads are indicated in my results, only 12-33% per sample across 6 samples. How does Kallisto address non-uniquely mapped reads? are they simply not included in the output count matrix? I'm concerned a substantial amount of data is being thrown out because of this. I've copied an example run_info.json and inspect.json file below

    cat run_info.json { "n_targets": 219393688, "n_bootstraps": 0, "n_processed": 340480788, "n_pseudoaligned": 90125537, "n_unique": 46079947, "p_pseudoaligned": 26.5, "p_unique": 13.5, "kallisto_version": "0.48.0", "index_version": -1293124848, "start_time": "Sun Jul 2 21:08:02 2023", "call": "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/kb_python/bins/darwin/kallisto/kallisto bus -i KIndex.Standard.GRCm39 -o cDNA9337WT_fr_10xMulti -x 10XV3 -t 16 --fr-stranded cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R1_001.fastq.gz cDNA9337WT_CKDL230013957-1A_22572CLT3_S4_L002_R2_001.fastq.gz" } cat inspect.json { "numRecords": 37168977, "numReads": 92983952, "numBarcodes": 1937400, "medianReadsPerBarcode": 3.000000, "meanReadsPerBarcode": 47.994194, "numUMIs": 12884443, "numBarcodeUMIs": 33830912, "medianUMIsPerBarcode": 1.000000, "meanUMIsPerBarcode": 17.462017, "gtRecords": 11411219, "numBarcodesOnWhitelist": 469183, "percentageBarcodesOnWhitelist": 24.217147, "numReadsOnWhitelist": 85704414, "percentageReadsOnWhitelist": 92.171189

  2. Is the "p_unique" variable reported in run_info.json as concerning as I suspect it is? or is it not to be over interpreted? Are there any alternative/additional options to pass to any of the Kallisto-Bustools components that might control how uniqueness of pseudo alignments is handled? or generate alternative statistics that are better to use?

  3. A previous analysis I performed with bulk data produced ~90% unique pseudo alignments, but that was for a different genome with only gene level annotations. Would it make sense to prepare an index that is only for the gene-level? I'm mainly interested in tabulation by gene at this point.

Thank you in advance for any input and advice! I like Kallisto/Bustools a lot, and am finding it easy to use, but need to ensure I'm applying it intelligently to this single cell data.

Sara Knaack