hartleys / QoRTs

Quality of RNA-Seq Toolset
52 stars 14 forks source link

Issue with the strandness test #8

Closed lxhuang7 closed 9 years ago

lxhuang7 commented 9 years ago

Hi Stephen, I ran into another issue with QoRTs and would like to hear your opinion. This time is the test of strandness for ~50 strand-specific paired-end Illumina RNA-Seq samples. From the sequencing provider I learn the libraries are fr-firststrand. QSeqc infer_experiment tool tells me the reads from each sample are at least >92% correctly mapped to the underneath gene model, which is expected. Picard.CollectRnaMetrics tool also shows all but one samples have >90% correct strand reads. The one left out has ~84% correct strand reads. However, QoRTs strandness test produces mixed results where most samples are either un-stranded or unknown-strand. I used the same gene model and with --stranded option enabled in QoRTs. Thanks!

hartleys commented 9 years ago

QoRTs uses a (rather arbitrary) cutoff to determine the "inferred" strandedness, and allows up to 10% divergence from the ideal. So any sample that is 90% stranded will be inferred as that strandedness rule. Any sample between 40-60% will be marked unstranded. All the rest will be marked "UNKNOWN_STRANDEDNESS". So it's basically just a guesstimate.

I chose these cutoffs based on my own (limited) experience. But this type of thing will probably vary A LOT depending on the protocol, sample quality, and species that you're using. If that is the degree of strandedness you usually expect to find, then it's probably fine.

The "inferred" strandedness isn't used for anything in QoRTs, it's just a warning that it prints at the end telling you to maybe check the strandedness.

You can see the QoRTs estimates for strandedness in the summary table, or in the QC.summary.txt file. There are a bunch of rows there, all starting with "StrandTest"

StrandTest_frFirstStrand: Read/read-pair count matching the fr_firstStrand rule.
StrandTest_frSecondStrand: Read/read-pair count matching the fr_secondStrand rule.
StrandTest_ambig_genesFountOnBothStrands: Read/read-pair counts for reads covering genes on both strands
StrandTest_ambig_noGenes: Read/read-pair counts excluded from strand test because they don't cover any known gene.
StrandTest_ambig_other: Read/read-pair counts excluded from strand test for other reasons.
StrandTest_STRANDEDNESS_MATCHES_INFERRED: best guess, does the strandedness match the expected?

If I recall correctly, the estimates provided by QoRTs and RSeQC differ slightly because of their treatment of "ambiguous" reads and of paired-end reads.

The one thing you might want to watch out for is extreme outliers. Sometimes if a few of your samples have a lower strand-specificity than the others then it can cause false differentials to appear when there are genes on opposite strands.

lxhuang7 commented 9 years ago

Thanks for the explanation in details.