pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
656 stars 172 forks source link

Behavior of kallisto quant with multiple lanes #446

Closed rsiani closed 4 months ago

rsiani commented 4 months ago

Hi, I was looking for a clarification on kallisto quant. I have sample spanning over multiple lanes (L1, L2, L3, L4) each with its paired read (R1, R2).

From the manual and the terminal read-outs it seems clear that the files are expected as L1-R1 L1-R2 L2-R1 L2-R2 etc etc, however (making a mistake) I noticed I have noticed higher mapping rates when I erroneously gave L1 L2 (R1) L1 L2 (R2)

`(base) bash-4.4$ kallisto quant -i ~/0723_extras/LR140_31.idx -o test -t 2 --plaintext R10_S49 _L001_R1_001.fastq.gz R10_S49_L001_R2_001.fastq.gz

[quant] fragment length distribution will be estimated from the data [index] k-mer length: 31 [index] number of targets: 4,598 [index] number of k-mers: 4,381,241 [quant] running in paired-end mode [quant] will process pair 1: R10_S49_L001_R1_001.fastq.gz R10_S49_L001_R2_001.fastq.gz [progress] 1M reads processed (58.3% mapped)

(base) bash-4.4$ kallisto quant -i ~/0723_extras/LR140_31.idx -o test -t 2 --plaintext R10_S49 _L001_R1_001.fastq.gz R10_S49_L002_R1_001.fastq.gz

[quant] fragment length distribution will be estimated from the data [index] k-mer length: 31 [index] number of targets: 4,598 [index] number of k-mers: 4,381,241 [quant] running in paired-end mode [quant] will process pair 1: R10_S49_L001_R1_001.fastq.gz R10_S49_L002_R1_001.fastq.gz [progress] 2M reads processed (70.6% mapped) done [quant] processed 2,539,758 reads, 1,793,073 reads pseudoaligned [quant] estimated average fragment length: 434.1 [ em] quantifying the abundances ... done [ em] the Expectation-Maximization algorithm ran for 52 rounds`

I assume this is because it's trying to merge L1 L2 as if they were paired reads, but why would that lead to higher mapping? I expected the opposite would happen....

p.s. how low would you go with kmer-sizes for ~ 150 bp fragments? I have low mapping rates at 31 and tried till 17, but of course I have to balance sensitivity and specificity. Is 21 definetely too low or nothing necessarily wrong with that?

Yenaled commented 4 months ago

Do L001 and L002 have the same number of reads? Maybe after one file runs out of reads, the other file is treated as single-end reads (I haven't verified this behavior).

For k-mer sizes, I don't really have a good intuition for how short is ok. I always stick with 31, especially for 150 bp reads which are super long (so the chances of finding a 31-bp exact match between my reads and the index is extremely high if the read is "real"). I personally wouldn't want to keep a read if I can't even find one 31-bp exact match...

rsiani commented 4 months ago

Ah that might be the case then! The files are all different sizes more or less.

Yeah, I agree with you on principle, but I am trying to salvage a low-quality run of bacterial bulk RNA-seq and it seems I have to relax thresholds a bit. Anyhow, thanks!