vegandevs / vegan

R package for community ecologists: popular ordination methods, ecological null models & diversity analysis
https://vegandevs.github.io/vegan/
GNU General Public License v2.0
436 stars 95 forks source link

number of rda axes #598

Closed aguilart closed 3 weeks ago

aguilart commented 9 months ago

Hi, I am plotting some RDA of morphological data against some experimental treatments in vegan. The morphological data is continous and the explanatory variables are continuous and categorical. I got confused about why the number of RDA axes is usually less than the amount of response variables in the dataset.

From what I understood, the algorithm behind RDA fits first a linear (multiple) regression of every response variable against the explanatory variables, returning a new matrix of expected values. Then, it makes a PCA on this matrix of expected values. The result of this process are the so called RDA axes. Thus, I would expect to have as many RDA axes as response variables are in the dataset. So why I get usually only less than that?

As example, I made this mock data where a matrix of 10 continuous response variables depends linearly on two variables: "day" (which is continuous) and "species" (which is categorical).

x <- c(1:16)

y1 <- c(jitter((10 *x + 2), amount = 15 ),
        jitter((10 *x + 3), amount = 15 ),
        jitter((10 *x + 5), amount = 15 ))

y2 <- c(jitter((5 *x + 7), amount = 15 ),
        jitter((5 *x + 16), amount = 15 ),
        jitter((5 *x + 32), amount = 15 ))

y3 <- c(jitter((0.5 *x + 8), amount = 15 ),
        jitter((0.5 *x + 62), amount = 15 ),
        jitter((0.5 *x + 145), amount = 15 ))

y4 <- c(jitter((0.001 *x + 0.01), amount = 15 ),
        jitter((0.001 *x + 0.02), amount = 15 ),
        jitter((0.001 *x + 0.03), amount = 15 ))

y5 <- c(jitter((0.001 *x + 0.1), amount = 15 ),
        jitter((0.001 *x + 0.2), amount = 15 ),
        jitter((0.001 *x + 0.3), amount = 15 ))

y6 <- c(jitter((0.001 *x + 10), amount = 15 ),
        jitter((0.001 *x + 22), amount = 15 ),
        jitter((0.001 *x + 35), amount = 15 ))

y7 <- c(jitter((0.001 *x + 100), amount = 15 ),
        jitter((0.001 *x + 200), amount = 15 ),
        jitter((0.001 *x + 3000), amount = 15 ))

y8 <- c(jitter((0.001 *x + 100), amount = 15 ),
        jitter((0.001 *x + 2000), amount = 15 ),
        jitter((0.001 *x + 300), amount = 15 ))

y9 <- c(jitter((0.001 *x + 10), amount = 15 ),
        jitter((0.001 *x + 200), amount = 15 ),
        jitter((0.001 *x + 37), amount = 15 ))

y10 <- c(jitter((0.9 *x + 0.1), amount = 15 ),
        jitter((0.9 *x + 0.2), amount = 15 ),
        jitter((0.9 *x + 0.3), amount = 15 ))

z <- rep(c("A", "B", "C"), each = 16)

mocking <- data.frame(var_1 = y1,  var_2 = y2,  var_3 = y3,  var_4 = y4,  var_5 = y5,
  var_6 = y6,  var_7 = y7,  var_8 = y8,  var_9 = y9,  var_10 = y10,
  day = rep(x, 3),
  species = z
)

If I run an RDA with "day", I get only 1 RDA axis:

rd_mocking <- 
  rda(mocking[,c(1:10)] ~ day,
      scale = TRUE, data = mocking)

rd_mocking

Eigenvalues for constrained axes:
  RDA1 
1.8721 

If I run an RDA with "species", I get 2 RDA axes:

rd_mocking <- 
  rda(mocking[,c(1:10)] ~ species,
      scale = TRUE, data = mocking)

rd_mocking

Eigenvalues for constrained axes:
  RDA1   RDA2 
2.7515 2.0286 

If I run an RDA with both "day" and "species" I get 5 axes:

rd_mocking <- 
  rda(mocking[,c(1:10)] ~ day * species,
      scale = TRUE, data = mocking)

rd_mocking

Eigenvalues for constrained axes:
  RDA1   RDA2   RDA3   RDA4   RDA5 
2.8731 2.1030 1.6888 0.1708 0.0204 

Is it that vegan "decides" the number of RDA axes to give in the output based on the amount of variation explained in the model? or am I understanding wrong how RDA is done?

Thanks for all the feedback on vegan!

jarioksa commented 9 months ago

Your expectation was wrong. In constrained analysis you should expect (at most) as many constrained axes as the number of constraints. The number of constrained axes depends on the rank of constraints (basically: it can be less if constraints are not independent). Variable day is a continuous variable with rank=1 and therefore you get only one constrained axis: it is the result of linear regression, one constraining variable, one constrained axis. Variable species is a three-level factor which has rank=2 (two contrasts, or species B and C each against species A), and therefore you get two axes. Model with interactions has rank=5 and therefore you get five axes. With default contrasts, the terms in ~ species*day are day, speciesB, speciesC, day:speciesB, day:speciesC.

aguilart commented 9 months ago

Aha! Thanks! Does this mean that, in the case of the model ~species*day I can interpret the RDA axes as follows?

RDA1: variation explained by day RDA2: variation explained by the contrast between species A and species B RDA3: variation explained by the contrast between species A and species C RDA4: variation explained by how different is the relationship of species A and day with respect to the relationship between day and species B RDA4: variation explained by how different is the relationship of species A and day with respect to the relationship between day and species C

gavinsimpson commented 9 months ago

No, each axis is a linear combination of the terms in the model. I forget now how exactly you see this, but I think you do it through the "reg" scores or canonical coefficients (@jarioksa will correct me if I am wrong):

> scores(rd_mocking, display = "reg")
                   RDA1       RDA2
day          0.33649315 -0.1763321
speciesB     0.03418037 -1.0984374
speciesC     0.92333491 -0.3511435
day:speciesB 0.01003758 -0.0228299
day:speciesC 0.03955184 -0.1043876
attr(,"const")
[1] 4.656123

or with the "bp" scores, which give some indication of the sign of the effect in relation to the axes themselves.

jarioksa commented 9 months ago

Regression coefficients indeed tell how the axes were constructed. However, Cajo ter Braak's recommendation is to use the option that is given in vegan as display="cn" (default) which gives biplot arrows for linear constraints and ordered factors and centroids of factor levels (incl. ordered factors). The latter are easier to interprete: regression ("reg") scores are influenced by correlations among constraints which confuses interpretation of single effects.