lmrodriguezr / nonpareil

Estimate metagenomic coverage and sequence diversity
http://enve-omics.ce.gatech.edu/nonpareil/
Other
42 stars 11 forks source link

Diversity index formula #27

Closed koopkaup closed 2 years ago

koopkaup commented 6 years ago

In the latest version a new diversity index is calculated. What formula and index value does it use?

Most recent R package version 3.3 lost all values explanations. Maybe that information should be put there.

lmrodriguezr commented 6 years ago

We're working on refactoring the R code to make it more stable and easier to maintain and understand, but unfortunately that comes with a little delay on the documentation. This will be included when we launch the release accompanying this v3.3 of the R code. We're currently working on a formal description of the diversity index that should be out soon (we'll release it while in review), but for now I can give a summary (from the text in preparation):

The Nonpareil Index of Sequence Diversity (Nd) has units of natural logarithm of base pairs and summarizes the community diversity in sequence space, i.e., how redundant the sequences of a dataset are among themselves. This metric depends on the joint distribution of genome size and abundance as well as intra-genome gene duplication. Therefore, given a small variation in genome sizes and a small impact of genomic duplications, e.g., for prokaryotic-only communities, Nd can be used as a database-independent metric of alpha-diversity. Since the shape of the Nonpareil curves from replicates and subsamples closely resemble each other regardless of coverage, we propose Nd as a coverage-independent measurement of diversity for the sampled community.

The actual formula is derived from the fitted model. As sigmoidal model, we use the CDF of a gamma distribution (in log-base-pairs space), so the index (Nd) is the mode of the fitted distribution curve. Given a fitted model with parameters α and β:

Nd = (α-1)/β:

I'll leave this issue open until we port the remainder of the documentation. Once passed, it'll be available with ?Nonpareil.Curve (with capital C, for the class, not the function).