pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
631 stars 168 forks source link

Does kallisto bus discard reads with low-quality barcodes? #256

Open willtownes opened 4 years ago

willtownes commented 4 years ago

I am trying to re-process the data from Grun et al 2016 (a CEL-Seq paper), and I am getting very low counts from kallisto and bustools, which are inconsistent with both the bioconductor scRNASeq object and the conquer repository. I am wondering if kallisto discards reads where the cell barcode is of low quality (ie if it contains a masked base "N"). After applying fasterq-dump, I observe the first read begins with NAGTCTCG , which I think is the cell barcode (first 8 bases). My question is, will this entire read be discarded by kallisto bus (with -x CELSeq)? It appears so, since there are no barcodes containing "N" in the resulting busfile. If that is the case, do you recommend I manually correct the barcodes to match the white list before running kallisto? For example, CAGTCTCG is on the whitelist and is edit distance 1 from NAGTCTCG. Here are the commands I'm running:

fasterq-dump SRR3472983 -O debug -t /tmp/scratch
cd debug
kallisto bus -x CELSeq -t 8 -i ~/sc-rna-seq/resources/Homo_sapiens_GRCh38.idx -o kallisto SRR3472983_1.fastq SRR3472983_2.fastq
willtownes commented 4 years ago

I should note that kallisto does not throw any errors or warnings, and the output shows that it processed 13,004,694 reads but only 2,307,233 reads pseudoaligned

chrarnold commented 3 years ago

I am running into the same issue. Can someone comment on how kallisto treats "N" as part of the barcode? In my case, I have "N" either in the first or second base, and I am not sure what the best strategy is here to proceed. The problem is kallisto index: It replaces any "N" characters with pseudorandom ones, so that real "N" in the barcode sequence do not match anything when using kallisto bus with a manually compiled (or through KITE, for example) list of fasta sequences that may contain N characters (i.e., the barcodes with 0 or 1 mismatch).

sbooeshaghi commented 3 years ago

@willtownes the barcodes containing "N"s are not discarded, "N"s in the barcode are replaced with a pseudorandom nucleotide. I have not delved into Cel-Seq data and am not familiar with the specifics of the technology and the read structure. Usually a whitelist helps with correcting barcodes but in the event that a whitelist does not exist one can use bustools whitelist to infer a whitelist. Low pseudoalignment rate is the result of a low amount biological reads not mapping. How does the read quality look like?

@chrarnold If you are trying to make an index of the barcodes so that you can use kite then you could in principle index all barcodes with a kmer length of 7 via the kite method. This could possibly help "quantification" of barcodes with the presence of "N" nucleotides.