kingsfordgroup / sailfish

Rapid Mapping-based Isoform Quantification from RNA-Seq Reads
http://www.cs.cmu.edu/~ckingsf/software/sailfish
GNU General Public License v3.0
124 stars 45 forks source link

Library depth with alignment-free methods #84

Closed roryk closed 8 years ago

roryk commented 8 years ago

Hi Rob,

We get a lot of questions about how much to sequence and whether or not to do paired end sequencing quite a bit and recommend people sequence around 50-60 million reads and do paired end sequencing if the goal of their experiment is to look at splicing. This makes experiments designed to look at splicing 2-3x more expensive than a DGE experiment. Do you have any thoughts about whether doing the alignment free methods makes it so people can sequence more shallowly or at least drop doing paired-end sequencing? It doesn't help with the biological variability of course but there is some component of the noise in assigning the reads to transcripts that looks like it might be minimized using Sailfish.

blahah commented 8 years ago

(I'm not Rob, but he pointed me at this)

Alignment-free methods are not likely to improve your ability to do DGE with lower depth when compared to alignment methods, nor do they make inference without paired-ends easier.

The main advantage of alignment-free methods is that they are faster - sometimes vastly so. This allows you to do things like bootstrapping, and therefore have some idea of the uncertainty in your mapping, which is theoretically useful in DGE but is only now starting to be explored (e.g. in sleuth). But these methods are not inherently more accurate, nor can they do more with less information.

However, sailfish for example uses the read pairing information to help improve assignment. Ultimately, read pairing is providing more information. So is higher depth (until you start saturating).

Also, I should point out that any RNA-seq experiment needs to care about splicing, even if they don't want to specifically look at it. This is because in order to estimate gene expression, you must first estimate the expression of all isoforms of a gene. If you don't you are introducing error in an unknown direction and of an unknown magnitude for every gene which has multiple isoforms.

So, in conclusion - doing the higher depth, paired-end experiment is the right thing to do in every situation and no matter which tools you use for analysis.

roryk commented 8 years ago

Thanks @blahblah for your thoughts, that is super helpful.

rob-p commented 8 years ago

@roryk — @Blahah's answer is, of course, spot on. There's only a small addendum I'd wish to make here. That is that "alignment-free" methods tend to be able to map (seemingly accurately) a somewhat larger fraction of the data than traditional alignment based methods. That is, "alignment-free" methods seem to be somewhat more robust to noise in the data and variation in the reference. The corollary to this is that, with the same sized library, you may be able to pull a little bit more signal out with an alignment-free method than with one based on alignments. Despite this, however, I still endorse everything @Blahah has said — you're better off sequencing paired-end data, and a higher depth library will generally give you better quantification results (though the returns are diminishing; especially once you begin to saturate the sample).

Best, Rob