vegandevs / vegan

R package for community ecologists: popular ordination methods, ecological null models & diversity analysis
https://vegandevs.github.io/vegan/
GNU General Public License v2.0
446 stars 96 forks source link

Suggestion: rank transformation in decostand()? #225

Closed elaliberte closed 7 years ago

elaliberte commented 7 years ago

High-throughput sequencing methods for characterizing e.g., soil microbial communities generate data matrices of 'sites' (or samples) as rows, and operational taxonomic units (OTUs) as columns. Values of cells are typically number of 'reads' (i.e. number of amplicons sequenced belonging to a particular OTU). Many studies treat these number of 'reads' as raw abundance data (and use vegan functions for community analyses).

However, many people criticize this usage because number of reads is not always clearly linked to abundance, due to PCR biases and all sorts of other problems. Some people therefore transform these data to presence-absence, but this is quite an extreme transformation and you end up losing a lot of potentially valuable information. Everyone seems to agree that these types of data could probably be treated at least semi-quantitatively. Therefore, my suggestion would be to add another method in decostand() to rank-transform. One important issue however, is that zeroes should stay zeroes during the rank transformation, see example below.

Is this an option that could be worth adding to decostand()?

It is worth noting that as absolute numbers of reads (i.e. row total) are meaningless in these studies because they are a direct result of PCR efficiency and a bunch of other things, everyone standardises these data to look at differences in relatives 'abundances' but not absolute differences. I realize that standardizing ranks is unorthodox, but to me it is less bad than not doing anything and treating number 'of reads' as quantitative data (i.e. 10 times more reads means that OTU was 10 times more abundant).

# example: different number of amplicons ('reads') of different OTUs in a sample
seqs <- c(0, 0, 2, 4, 19, 23)

# rank transformation to treat these data semi-quantitatively
seqs.rank <- rank(seqs)
seqs.rank # not ok; note that 0's were converted 
# [1] 1.5 1.5 3.0 4.0 5.0 6.0

# solution: first replace 0's with NA's
seqs.na <- seqs
seqs.na[seqs.na == 0] <- NA
seqs.na
# [1] NA NA  2  4 19 23

# rank again, keeping NA's
seqs.na.rank <- rank(seqs.na, na.last = 'keep')
seqs.na.rank # ok now, but need to replace NA's with 0's
# [1] NA NA  1  2  3  4

# final step
seqs.na.final <- seqs.na.rank
seqs.na.final[is.na(seqs.na.final)] <- 0
seqs.na.final # now ok: non-zero values, ranked, but 0's (absences) preserved
# [1] 0 0 1 2 3 4
jarioksa commented 7 years ago

It looks that the following code would do:

rank = {
      is.na(x) <- x == 0
      x <- apply(x, MARGIN, rank, na.last = "keep")
      x[is.na(x)] <- 0
      x
}

Comments are welcome, both on the idea and on the implementation.

What worries me here is that the maximum rank is related to the frequency of species: is it really so that more common (= higher number of occurrences) species should get higher abundance values?

elaliberte commented 7 years ago

Thanks, yes, this implementation is exactly what I had in mind, although I was only thinking about the situation where MARGIN = 1 (rows). I'm not sure in what situation one would want to rank by columns (species), although I guess it doesn't hurt to provide the option, but perhaps provide MARGIN = 1 as the default? You're right about the maximum rank being a function of species richness, but I initially didn't see this as a big problem since people standardise by row after (though probably best not to assume they would). In this case perhaps one option would be to standardize by the maximum rank in each row to obtain "relative ranks"? Something like:

rank = {
     if (missing(MARGIN)) MARGIN <- 1
     is.na(x) <- x == 0
     x <- apply(x, MARGIN, rank, na.last = "keep")
     tmp <- apply(x, MARGIN, max, na.rm = T)
     x <- sweep(x, MARGIN, tmp, "/")
     x[is.na(x)] <- 0
}
elaliberte commented 7 years ago

Actually, probably better to standardise by margin total?

rank = {
     if (missing(MARGIN)) MARGIN <- 1
     is.na(x) <- x == 0
     x <- apply(x, MARGIN, rank, na.last = "keep")
     tmp <- apply(x, MARGIN, sum, na.rm = T)
     x <- sweep(x, MARGIN, tmp, "/")
     x[is.na(x)] <- 0
}
eduardszoecs commented 7 years ago

Using

x <- replace(x, x == 0, NA)

instead of

 is.na(x) <- x == 0

is slightly faster.

jarioksa commented 7 years ago

My suggestion is indeed a bit slow, but microbenchmark did not find any clear or consistent differences with normal data sizes. Here the test results:

library(vegan)
library(microbenchmark)
data(BCI)
bci <- as.matrix(BCI) ## as.matrix will be used in decostand
fun1 <- function(x) is.na(x) <- x == 0           ## my original
fun2 <- function(x) x[x==0] <- NA                ## the classic
fun3 <- function(x) x <- replace(x, x == 0, NA)  ## @EDiLD
microbenchmark(fun1(bci), fun2(bci), fun3(bci))
#Unit: microseconds
#      expr     min       lq     mean   median       uq      max neval
# fun1(bci) 101.587 118.7785 204.8792 125.1175 208.7510 2885.952   100
# fun2(bci)  89.953 105.0370 179.6405 108.0380 182.4740 3350.545   100
# fun3(bci)  91.386 105.5060 189.9827 107.1530 183.2845 2994.159   100
eduardszoecs commented 7 years ago

Jupp, no consistent differences (though fun1() seems to be slightly slower). The classic one is IMO more readable...

jarioksa commented 7 years ago

@elaliberte : it seems that apply(x, 1, ...) will transpose x and we need to back-transpose.

About standardization: If the number of above-zero values (number of occurrences or number of species depending on the MARGIN) is f, then the sum of ranks will be (f^2 + f)/2, and the maximum rank is f (and lower when there are ties at highest rank). I think a different standardization is needed unless we want to down-weight OTU-rich sampling units. Divide by f to give proportional ranks in 0..1?

jarioksa commented 7 years ago

I have now implemented the rank option for decostand in a separate branch. This performs the unadjusted rank-standardization at the moment, but does no adjustment to ranks, but these vary with the number of non-zero items. I was originally considering to adjust these by maximum possible rank (not the maximum observed rank which can be lower if the highest values are tied), and @elaliberte suggested adjustment by total. At the moment I am not sure what to do, and I also have considered leaving the rank standardization like it is.

To compare different options, let us see the default case of MARGIN = 1 which standardizes by sampling units, and let S be the number of species / OTUs / taxa for each unit. Then -- in the absence of ties -- the different adjustments will give us:

Adjustment Ranks Total
None 0, 1, ..., S S(S + 1)/2
Max rank 0, 1/S, ..., 1 (S + 1)/2
Total 0, 2/ (S^2 + S), ..., 2/(S + 1) 1

It is difficult to say which of these is generally most useful. Two important points are:

  1. What is the meaning of S? Is the number of species a meaningful value ("alpha diversity") or a consequence of sampling intensity, sampling success or capture rate? The original request quoted example of high-throughput sequencing where S is hardly meaningful. Other typical examples are samples of benthic invertebrates with variable and unknown capture rate. Even if S is meaningful, do we want to change a superabundant monodominant species to 1 and a scarce species in a diverse community to S?

  2. How these transformed abundances are used? Several later analytic tools perform their own adjustments of raw values, and these can over-turn any adjustment we did previously. Primarily I see this rank adjustment as a way of fixing strongly skewed abundance values for data where ranks may be meaningful and ties are rare. This is the case for sequence data, count data of individuals etc. It does not serve any other useful purpose. For instane, any real differences in total abundances will vanish. Apart from skewness, many analytic tools will include semi-automatic correction for unequal and unknown capture rate. For instance, correspondence analysis includes a double standardization which is just designed for these kind of data.

Currently I am not sure how to standardize the results. I am mainly hovering between two alternatives: do nothing or use relative ranks (max rank).

BErfanian commented 7 years ago

Hi, I had the problem of "superabundant monodominant species" in calculating rank-abundance plots, the two dominant species alter when I compute rank-abundance plot for P/A and abundance data (and both of them are not the "real" type I observed in field, because of a superabundant and occasional species). After some thinking, I used the frequency of a species (number of presence in SUs / total number of SUs) as a downweighting coefficient for abundance of that species. After using this, the results looks better.

elaliberte commented 7 years ago

Good points @jarioksa, and I agree that it is not obvious what is generally most useful. In this case, perhaps the best strategy would be to just provide ranks (i.e. do nothing).

jarioksa commented 7 years ago

I could not make up my minds, and I implemented both raw ranks (rank) and relative ranks (rrank). OK?

elaliberte commented 7 years ago

This sounds good to me. Thanks again!

jarioksa commented 7 years ago

Closed with merge commit 0fe6a5d