Closed koopkaup closed 2 years ago
We're working on refactoring the R code to make it more stable and easier to maintain and understand, but unfortunately that comes with a little delay on the documentation. This will be included when we launch the release accompanying this v3.3 of the R code. We're currently working on a formal description of the diversity index that should be out soon (we'll release it while in review), but for now I can give a summary (from the text in preparation):
The Nonpareil Index of Sequence Diversity (Nd) has units of natural logarithm of base pairs and summarizes the community diversity in sequence space, i.e., how redundant the sequences of a dataset are among themselves. This metric depends on the joint distribution of genome size and abundance as well as intra-genome gene duplication. Therefore, given a small variation in genome sizes and a small impact of genomic duplications, e.g., for prokaryotic-only communities, Nd can be used as a database-independent metric of alpha-diversity. Since the shape of the Nonpareil curves from replicates and subsamples closely resemble each other regardless of coverage, we propose Nd as a coverage-independent measurement of diversity for the sampled community.
The actual formula is derived from the fitted model. As sigmoidal model, we use the CDF of a gamma distribution (in log-base-pairs space), so the index (Nd) is the mode of the fitted distribution curve. Given a fitted model with parameters α and β:
Nd = (α-1)/β:
I'll leave this issue open until we port the remainder of the documentation. Once passed, it'll be available with ?Nonpareil.Curve
(with capital C, for the class, not the function).
In the latest version a new diversity index is calculated. What formula and index value does it use?
Most recent R package version 3.3 lost all values explanations. Maybe that information should be put there.