Estimating coverage per sequencing effort

ottentim commented 8 years ago

When running nonpareil from the command line and viewing the .npo output file in R, where does the "empty circle" value for the estimated coverage depth vs sequencing effort com from?? Presumably it is from the end of the .npo file (see below for output), which in the case of the attached output suggests that the coverage was about 72.8% for my sequencing effort: ... 1732486 0.35225 0.05063 0.31818 0.35354 0.38462 2474980 0.40542 0.04240 0.37692 0.40458 0.43333 3535685 0.46346 0.03631 0.43787 0.46512 0.48837 5050979 0.51771 0.02825 0.49793 0.51894 0.53689 7215684 0.57511 0.02255 0.56012 0.57550 0.59053 10308120 0.63122 0.01838 0.61943 0.63125 0.64300 14725885 0.68184 0.01314 0.67288 0.68182 0.69086 21036979 0.72835 0.00807 0.72249 0.72833 0.73418

Yet when I plot the data in R, the empty circle appears about 8-10% higher than this, and this is the same for all six of the datasets I'm trying this on.

Am I interpreting this correctly, Is the last line, second column of the .npo output file the estimate of coverage depth for my sequencing effort?? If not, how do I determine this? Thanks for your help, I think your method has a lot of potential!!

lmrodriguezr commented 8 years ago

Hello @ottentim. Your interpretation of the plot is correct, but note that the values in the .npo file are not measuring the same thing. The values in the .npo file correspond to redundancy, which is a value specific to the parameters used, in particular to -L. On the other hand, the plot's values are the estimated coverage depth, which is (on average) independent of the -L parameter. Both measurements correlate very well in log-log space, but the exact values differ (as you saw in your examples).

Using the notation of the original publication, the .npo values are κ (redundancy), while the values in the plot are C-hat (estimated abundance-weighted average coverage). Assuming that you used -L 0.5 (default), a value of redundancy of 72.8% would correspond to an average coverage of about 78.6% (0.728^0.757, from Suppl. Table 1), which looks about right in your plot.

To get the exact value of coverage, see the output of Nonpareil.curve in the R console, which should be a list (or a data.frame, if you used Nonpareil.curve.batch). The value labeled C corresponds to C-hat.

ottentim commented 8 years ago

Thank you Luis,

Your response was very helpful. This is a really great tool! Happy New Year.

Cheers,

Tim Otten

On Wed, Dec 30, 2015 at 11:28 AM, Luis M Rodriguez-R < notifications@github.com> wrote:

Hello @ottentim https://github.com/ottentim. Your interpretation of the plot is correct, but note that the values in the .npo file are not measuring the same thing. The values in the .npo file correspond to redundancy http://nonpareil.readthedocs.org/en/latest/redundancy.html#output, which is a value specific to the parameters used, in particular to -L. On the other hand, the plot's values are the estimated coverage depth, which is (on average) independent of the -L parameter. Both measurements correlate very well in log-log space, but the exact values differ (as you saw in your examples).

Using the notation of the original publication http://bioinformatics.oxfordjournals.org/content/30/5/629.long, the .npo values are κ (redundancy), while the values in the plot are C-hat (estimated abundance-weighted average coverage). Assuming that you used -L 0.5 (default), a value of redundancy of 72.8% would correspond to an average coverage of about 78.6% (0.7280.757, from Suppl. Table 1 http://bioinformatics.oxfordjournals.org/content/30/5/629/suppl/DC1), which looks about right in your plot.

To get the exact value of coverage, see the output of Nonpareil.curve in the R console, which should be a list (or a data.frame, if you used Nonpareil.curve.batch). The value labeled C corresponds to C-hat.

— Reply to this email directly or view it on GitHub https://github.com/lmrodriguezr/nonpareil/issues/12#issuecomment-168060261 .

Timothy G. Otten, PhD, MPH Department of Microbiology Oregon State University 226 Nash Hall Corvallis, OR 97331 T: 541-737-1796

lmrodriguezr / nonpareil

Estimating coverage per sequencing effort #12