Diversity index formula

We're working on refactoring the R code to make it more stable and easier to maintain and understand, but unfortunately that comes with a little delay on the documentation. This will be included when we launch the release accompanying this v3.3 of the R code. We're currently working on a formal description of the diversity index that should be out soon (we'll release it while in review), but for now I can give a summary (from the text in preparation):

The Nonpareil Index of Sequence Diversity (N_d) has units of natural logarithm of base pairs and summarizes the community diversity in sequence space, i.e., how redundant the sequences of a dataset are among themselves. This metric depends on the joint distribution of genome size and abundance as well as intra-genome gene duplication. Therefore, given a small variation in genome sizes and a small impact of genomic duplications, e.g., for prokaryotic-only communities, N_d can be used as a database-independent metric of alpha-diversity. Since the shape of the Nonpareil curves from replicates and subsamples closely resemble each other regardless of coverage, we propose N_d as a coverage-independent measurement of diversity for the sampled community.

The actual formula is derived from the fitted model. As sigmoidal model, we use the CDF of a gamma distribution (in log-base-pairs space), so the index (N_d) is the mode of the fitted distribution curve. Given a fitted model with parameters α and β:

N_d = (α-1)/β:

I'll leave this issue open until we port the remainder of the documentation. Once passed, it'll be available with ?Nonpareil.Curve (with capital C, for the class, not the function).

lmrodriguezr / nonpareil

Diversity index formula #27