Kallisto specificity - Githubissues

shrukane commented 8 years ago

Hi, I have been using Kallisto-sleuth for my RNA seq analysis and I had a discrepancy in my analysis. I had mentioned it as an issue on Git hub #100(reverting back as I couldn't find an explanation). I have used a simulator called Polyester (https://github.com/alyssafrazee/polyester) to simulate RNA-seq reads. I found that although the sensitivity of Kallisto is quite high the specificity is low. I tried the latest version of Sleuth using the likelihood ratio test. The numbers are better than the Wald test but not as great as expected (false positives vs true positives). In comparison, tophat-cufflinks looks v good. would love to get your feedback. I do not mind sharing the preliminary results or ROC curves by mail. -Shruti

pimentel commented 8 years ago

Hello @shrukane

I am sorry for the delay. I noticed you sent me a private email a while back. I was hoping to have a paper to send to you, and that should realistically happen within a week or so.

A few questions:

how are you simulating with polyester?
- how many samples?
- how are you stimulating effect sizes?
- how many reads?
- what transcriptome?
- what fraction of things are differentially expressed and how many of these things have mean estimated counts more than 5?

My experience has been that polyester is realistically too slow to simulate from the entire transcriptome.

We have been doing extensive benchmarks in several different conditions and this is simply not what we see given the types of simulations we have done. In our simulations, we are simulating from a negative binomial the transcript level (similar to polyester) but using the RSEM simulator. We have simulated a number of different types of effect sizes across the entire transcriptome and when comparing the true false discovery rate versus the ranking by estimated fdr (qval) sleuth comes ahead in pretty much every type of simulation that we have thrown at it.

shrukane commented 8 years ago

@pimentel we simulated varying numbers (0.1M, 0.2M, 0.3M, 0.5M, 0.75M, 1M, 3M, 5M, 7M and 10M) of paired end 100bp Illumina like RNASeq reads with a fragment length distribution of 250±25 bases, each with 1.2, 1.5, 2, 3, 4 and 5 fold differentially expressed genes between two conditions. For each combination, reads were generated for 2, 3, 4, 5 and "no" replicates. The transcriptome is the human chromosome 14 of the hg_19 build, and was downloaded with only one transcript/gene. We have differentially expressed 20% of the genes in the simulation. This is my simulation code - library(polyester) library(Biostrings) set.seed(142) # random number generator for reproducibility fasta_file = 'chr14_transcript.fa' fasta = readDNAStringSet(fasta_file) deInds = sample(1:length(fasta),size= 408, replace = TRUE) fold_changes = rep(1,length(fasta)) fold_changes[deInds] = rep(c(4,1/4), length(deInds)/2) outdir = '0.1M/' num_reps = 5 readspertx = round(100000 / width(fasta)) simulate_experiment(fasta_file, reads_per_transcript=readspertx, readlen= 100, fold_changes=fold_changes, num_reps= num_reps, outdir=outdir)

I hope this helps. -Shruti

pachterlab / kallisto

Kallisto specificity #108