pachterlab / kallisto

Near-optimal RNA-Seq quantification
https://pachterlab.github.io/kallisto
BSD 2-Clause "Simplified" License
649 stars 171 forks source link

Kallisto specificity #108

Open shrukane opened 8 years ago

shrukane commented 8 years ago

Hi, I have been using Kallisto-sleuth for my RNA seq analysis and I had a discrepancy in my analysis. I had mentioned it as an issue on Git hub #100(reverting back as I couldn't find an explanation). I have used a simulator called Polyester (https://github.com/alyssafrazee/polyester) to simulate RNA-seq reads. I found that although the sensitivity of Kallisto is quite high the specificity is low. I tried the latest version of Sleuth using the likelihood ratio test. The numbers are better than the Wald test but not as great as expected (false positives vs true positives). In comparison, tophat-cufflinks looks v good. would love to get your feedback. I do not mind sharing the preliminary results or ROC curves by mail. -Shruti

pimentel commented 8 years ago

Hello @shrukane

I am sorry for the delay. I noticed you sent me a private email a while back. I was hoping to have a paper to send to you, and that should realistically happen within a week or so.

A few questions:

My experience has been that polyester is realistically too slow to simulate from the entire transcriptome.

We have been doing extensive benchmarks in several different conditions and this is simply not what we see given the types of simulations we have done. In our simulations, we are simulating from a negative binomial the transcript level (similar to polyester) but using the RSEM simulator. We have simulated a number of different types of effect sizes across the entire transcriptome and when comparing the true false discovery rate versus the ranking by estimated fdr (qval) sleuth comes ahead in pretty much every type of simulation that we have thrown at it.

shrukane commented 8 years ago

@pimentel we simulated varying numbers (0.1M, 0.2M, 0.3M, 0.5M, 0.75M, 1M, 3M, 5M, 7M and 10M) of paired ­end 100bp Illumina ­like RNA­Seq reads with a fragment length distribution of 250±25 bases, each with 1.2­, 1.5­, 2­, 3­, 4­ and 5 ­fold differentially expressed genes between two conditions. For each combination, reads were generated for 2, 3, 4, 5 and "no" replicates. The transcriptome is the human chromosome 14 of the hg_19 build, and was downloaded with only one transcript/gene. We have differentially expressed 20% of the genes in the simulation. This is my simulation code - library(polyester) library(Biostrings) set.seed(142) # random number generator for reproducibility fasta_file = 'chr14_transcript.fa' fasta = readDNAStringSet(fasta_file) deInds = sample(1:length(fasta),size= 408, replace = TRUE) fold_changes = rep(1,length(fasta)) fold_changes[deInds] = rep(c(4,1/4), length(deInds)/2) outdir = '0.1M/' num_reps = 5 readspertx = round(100000 / width(fasta)) simulate_experiment(fasta_file, reads_per_transcript=readspertx, readlen= 100, fold_changes=fold_changes, num_reps= num_reps, outdir=outdir)

I hope this helps. -Shruti