vegandevs / vegan

R package for community ecologists: popular ordination methods, ecological null models & diversity analysis
https://vegandevs.github.io/vegan/
GNU General Public License v2.0
436 stars 95 forks source link

How to use R function `prc` in package `vegan` to produce the tables in van den Brink, P.J. & ter Braak, C.J.F. (1999) #17

Closed zhenglei-gao closed 9 years ago

zhenglei-gao commented 10 years ago

Note this is a duplicate of my question on Cross Validated

This is also not a bug report but a question.

The paper is:

van den Brink, P.J. & ter Braak, C.J.F. (1999). Principal response curves: Analysis of time-dependent multivariate responses of biological community to stress. Environmental Toxicology and Chemistry, 18, 138-148.

The example data used in the paper:

spdta <- structure(list(`Species 1` = c(100, 100, 100, 100, 100, 110, 
110, 137.5, 157.143, 157.143, 120, 120, 150, 240, 240, 130, 130, 
144.444, 216.667, 216.667), `Species 2` = c(100, 100, 100, 100, 
100, 90, 90, 90, 90, 90, 80, 80, 80, 80, 80, 100, 100, 100, 100, 
100), `Species 3` = c(100, 100, 100, 100, 100, 120, 120, 96, 
84, 84, 140, 140, 112, 70, 70, 160, 160, 144, 96, 96), `Species 4` = c(100, 
100, 100, 100, 100, 90, 90, 72, 63, 63, 80, 80, 64, 40, 40, 70, 
70, 63, 42, 42), `Species 5` = c(200, 200, 200, 200, 200, 240, 
240, 153.6, 117.6, 117.6, 240, 240, 153.6, 60, 60, 200, 200, 
162, 72, 72), `Species 6` = c(100, 100, 100, 100, 100, 130, 130, 
83.2, 63.7, 63.7, 70, 70, 44.8, 17.5, 17.5, 100, 100, 81, 36, 
36)), .Names = c("Species 1", "Species 2", "Species 3", "Species 4", 
"Species 5", "Species 6"), class = "data.frame", row.names = c(NA, 
-20L))

# time
week <- gl(4, 5, labels=0:3)
# treatment
dose <- factor(rep(c("C","C","L","H","H"), 4),levels=c("C","L","H"),ordered=FALSE)

Perform a principal response curve analysis:

require(vegan)
#the species data are natural log(x) transformed
mod <- prc(response = log(spdta), treatment = dose, time = week)
plot(mod)

However, the species scores are difference from the ones listed in the paper, but I can get it back by the function scores

# Species scores:
scores.sps.1999 <- scores (mod, choices=1, scaling=1, const=-sqrt(nrow(spdta)), dis='sp')
scores.sps.1999

My question is:

  1. The canonical coefficients are also different from the paper. I can manage to transform my results into the paper ones but don't know why.

    sum_prc <- summary(mod,scaling=1)#, const=-sqrt(nrow(spdta)))

    sum_prc

    -(sum_prc$coefficients)/(sqrt(nrow(spdta))/2)

  2. Also, the paper gives the standardized canonical coefficients rdt and standard deviations sdt. How do I get this piece of information from my PRC results?

    rdt <- c(0,0,0,0,0.1805,0.3970,0,0.1805,0.7716,0,0.0852,0.5686) sdt <- c(1e-10,1e-10,1e-10,1e-10,0.2179,0.3,1e-10,0.2179,0.3,1e-10,0.2179,0.3) Cdt <- 0.2*rdt/sdt

eduardszoecs commented 10 years ago

Beforehand: I do not have a CANOCO licence, so cannot check the results with CANOCO (it's version 5 now i think...) Seems like you are trying to reproduce the results of CANOCO...

To Question1:

I think that the different results for species and coefficients are because of different scalings.

The canonical coefficients are also different from the paper. I can manage to transform my results into the paper ones but don't know why.

But this doesn't work with the pyrifos data also shown in the paper.

require(vegan)
data(pyrifos)
week <- gl(11, 12, labels=c(-4, -1, 0.1, 1, 2, 4, 8, 12, 15, 19, 24))
dose <- factor(rep(c(0.1, 0, 0, 0.9, 0, 44, 6, 0.1, 44, 0.9, 0, 6), 11))
# PRC
pyr_mod <- prc(pyrifos, dose, week)
sum_prc <- summary(pyr_mod,scaling=1)

-(sum_prc$coefficients)/(sqrt(nrow(pyrifos))/2)

             -4          -1         0.1            1           2           4           8
0.1 -0.01256499 -0.02392853 -0.01775398 -0.007080778 -0.03656835 -0.02375052 -0.01405983
0.9 -0.01411007 -0.03368819 -0.03370698 -0.083033930 -0.08663741 -0.07496010 -0.02355583
6   -0.02892427 -0.02144969 -0.07901450 -0.201299463 -0.18861269 -0.20037965 -0.09767765
44  -0.02433359 -0.03409077 -0.12721855 -0.219491064 -0.22591242 -0.25461471 -0.22480271
             12          15          19          24
0.1 -0.02673332 -0.01954409 -0.03765507 -0.01363846
0.9 -0.06175443 -0.04191742 -0.03057287 -0.02688173
6   -0.08177758 -0.05358336 -0.05731724 -0.03172990
44  -0.17548289 -0.13610829 -0.10041584 -0.05400252

From my experience I would suggest the sum of all eigenvalues as scaling constant for sites.

sum_prc2 <- summary(pyr_mod, scaling = 1,
        const = c(sqrt(nrow(pyrifos)), sqrt(pyr_mod$tot.chi))) 

This comes Fig. 3 in the paper really close:

## coef
sum_prc2$coefficients

            -4        -1       0.1          1         2         4          8        12        15
0.1 0.08796762 0.1675239 0.1242957 0.04957259 0.2560154 0.1662776 0.09843299 0.1871602 0.1368284
0.9 0.09878469 0.2358513 0.2359829 0.58132125 0.6065493 0.5247963 0.16491458 0.4323433 0.2934642
6   0.20249903 0.1501695 0.5531812 1.40929925 1.3204791 1.4028596 0.68384205 0.5725255 0.3751376
44  0.17035965 0.2386698 0.8906581 1.53665881 1.5816147 1.7825597 1.57384571 1.2285572 0.9528953
           19         24
0.1 0.2636234 0.09548298
0.9 0.2140409 0.18819922
6   0.4012785 0.22214133
44  0.7030122 0.37807207

## species
sort(sum_prc2$sp)[c(1:5, 172:178)]

  caenhora      NauLa   cloedipt      Strvi   mystloni      Colob      Cepsp      Colun   erpoocto 
-4.7327096 -3.9771836 -3.8844344 -2.5187993 -2.4603371  0.5932877  0.6643014  0.6796789  0.7393253 
  olchaeta      amosp   binitent 
 0.9560482  1.1140080  1.6004508 

With the artifial data I get, which is again quite close to Fig. 5.

spdta <- structure(list(`Species 1` = c(100, 100, 100, 100, 100, 110, 
                                        110, 137.5, 157.143, 157.143, 120, 120, 
                                        150, 240, 240, 130, 130, 144.444, 
                                        216.667, 216.667), 
                        `Species 2` = c(100, 100, 100, 100, 100, 90, 90, 90, 90,
                                        90, 80, 80, 80, 80, 80, 100, 100, 100, 
                                        100, 100), 
                        `Species 3` = c(100, 100, 100, 100, 100, 120, 120, 96, 
                                        84, 84, 140, 140, 112, 70, 70, 160, 160, 
                                        144, 96, 96), 
                        `Species 4` = c(100, 100, 100, 100, 100, 90, 90, 72, 63, 
                                        63, 80, 80, 64, 40, 40, 70, 70, 63, 42, 
                                        42), 
                        `Species 5` = c(200, 200, 200, 200, 200, 240, 240, 
                                        153.6, 117.6, 117.6, 240, 240, 153.6, 
                                        60, 60, 200, 200, 162, 72, 72), 
                        `Species 6` = c(100, 100, 100, 100, 100, 130, 130, 83.2, 
                                        63.7, 63.7, 70, 70, 44.8, 17.5, 17.5, 
                                        100, 100, 81, 36, 36)), 
                   .Names = c("Species 1", "Species 2", "Species 3", "Species 4", 
                              "Species 5", "Species 6"), 
                   class = "data.frame", 
                   row.names = c(NA, -20L))

week <- gl(4, 5, labels=0:3)
dose <- factor(rep(c("C","C","L","H","H"), 4),levels=c("C","L","H"),ordered=FALSE)
mod <- prc(response = log(spdta), treatment = dose, time = week)

summary(mod, scaling = 1, const = c(sqrt(nrow(spdta)), sqrt(mod$tot.chi)))

Call:
prc(response = log(spdta), treatment = dose, time = week) 
Species scores:
Species 1 Species 2 Species 3 Species 4 Species 5 Species 6 
   -1.348     0.000     1.348     1.348     2.697     2.697 

Coefficients for dose + week:dose interaction
which are contrasts to dose C 
rows are dose, columns are week
          0       1       2        3
L 1.705e-16 -0.1698 -0.1698 -0.08017
H 6.930e-17 -0.2714 -0.5274 -0.38868

Don't know what scaling is used in CANOCO :(

To Question 2:

Also, the paper gives the standardized canonical coefficients rdt and standard deviations sdt. How do I get this piece of information from my PRC results?

Just a few thoughs on this (maybe I am completly wrong...):

As far as I know (correct me if wrong) CANOCO uses a iterative ordination algorithm. vegan uses another algorithm (singular value decomspostion), therefore you don't get regression coefficients.

Maybe you can use ?as.mlm to get these (although the results are not identical to the paper)?

summary(as.mlm(mod))

Call:
lm(formula = x$CCA$wa ~ . - 1, data = as.data.frame(X))

Residuals:
       Min         1Q     Median         3Q        Max 
-1.724e-16 -9.112e-17 -8.924e-17 -8.737e-17 -6.057e-18 

Coefficients:
                 Estimate Std. Error    t value Pr(>|t|)    
Zweek1          1.974e-01  1.394e-16  1.416e+15   <2e-16 ***
Zweek2          3.392e-01  1.394e-16  2.434e+15   <2e-16 ***
Zweek3          2.375e-01  1.394e-16  1.704e+15   <2e-16 ***
`Xweek0:doseL`  2.301e-16  1.707e-16  1.348e+00    0.211    
`Xweek1:doseL` -2.352e-01  1.707e-16 -1.378e+15   <2e-16 ***
`Xweek2:doseL` -2.352e-01  1.707e-16 -1.378e+15   <2e-16 ***
`Xweek3:doseL` -1.110e-01  1.707e-16 -6.505e+14   <2e-16 ***
`Xweek0:doseH`  4.684e-17  1.394e-16  3.360e-01    0.745    
`Xweek1:doseH` -3.759e-01  1.394e-16 -2.697e+15   <2e-16 ***
`Xweek2:doseH` -7.305e-01  1.394e-16 -5.241e+15   <2e-16 ***
`Xweek3:doseH` -5.383e-01  1.394e-16 -3.862e+15   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.394e-16 on 9 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 4.68e+30 on 11 and 9 DF,  p-value: < 2.2e-16

Maybe Jari or someone else can give better advice on this.

Hope this helps, Cheers from Landau,

Edi

jarioksa commented 10 years ago

I don't have the paper now. It is somewhere in my office, but I was just relocated to a new building, and digging up the paper may take time. Therefore I cannot compare the results directly to the vegan output. However, here some points of view.

The current prc function in in vegan is written by Cajo ter Braak, and he assured that it will give identical results to CANOCO with scaling 1. As far as I remember (this happened in November 2011), Cajo said that other scaling could not be reproduce as easily.

CANOCO 3 and CANOCO 4 have different scalings. So the results will change there as well. However, Cajo told me that most of the CANOCO scalings must be undone for PRC so that this may not be a key point.

The largest difference in scalings is that CANOCO scales RDA/PCA so that sum of all eigenvalues = 1, whereas vegan scales their sum to sum of unbiased variances of species (unbiased = divisor is n-1). Vegan can further scale the scores in display to a biplot scaling. This has been explained in the vignette on decisions-vegan.pdf (use vegandocs("dec") to read the vignette). However, we redo the coefficients in vegan prc so that directly obtained coefficients of the result object and those displayed in the output will differ. Be careful.

As to t-values etc: Edi told you how to get those with as.mlm(). However, in your example these hardly make sense and hardly can be reproduced: your example has complete fit, and there is no residual variation, except what is caused by round off errors. I cannot see how you can get any of those statistics with these models, except as a random noise.

zhenglei-gao commented 10 years ago

@EDiLD

First, I am not trying to reproduce the results in CANOCO but the results in the paper. The purpose is to have a GLP validated program to perform the PRC analysis. For that I need to show the authorities ( in the plant protection products area, who only accept CANOCO now) that using prc in vegan on my GLP-validated laptop can produce exactly(almost exactly) results in the literature where they have their results obtained from CANOCO, and apparently not with CANOCO 5 but with CANOCO 3 and 4. Their 1999 and 2002 results are using different constant themselves.

For the first question, below is what I have. Although your constant has produced very similar results to the literature ones, but my constant is closer! And I still don't understand one.


## Canonical Coefficients:

sum_prc <- summary(mod, scaling = 1, const = -4/sqrt(20))
prc.R1 <- c((rbind(rep(0, 4), sum_prc$coefficients)))
names(prc.R1) <- c(sapply(0:3, function(i) paste(c("C", "L", "H"), "-W", i, 
    sep = "")))
sum_prc <- summary(mod, scaling = 1, const = -4/sqrt(6))
prc.R2 <- c((rbind(rep(0, 4), sum_prc$coefficients)))
sum_prc <- summary(mod, scaling = 1, const = c(sqrt(nrow(spdta)), sqrt(mod$tot.chi)))
prc.R3 <- c((rbind(rep(0, 4), sum_prc$coefficients)))
rdt <- c(0, 0, 0, 0, 0.1805, 0.397, 0, 0.1805, 0.7716, 0, 0.0852, 0.5686)
sdt <- c(1e-10, 1e-10, 1e-10, 1e-10, 0.2179, 0.3, 1e-10, 0.2179, 0.3, 1e-10, 
    0.2179, 0.3)
Cdt <- 0.2 * rdt/sdt
names(Cdt) <- c(sapply(0:3, function(i) paste(c("C", "L", "H"), "-W", i, sep = "")))
prc1.1999 <- Cdt
prc1.2002 <- 0.364917 * rdt/sdt
prc1 <- cbind(prc.R1, prc1.1999, prc.R2, prc1.2002)
colnames(prc1) <- c("R, constant= -4/sqrt(20)", "van den Brink et al, 1999", 
    "R,constant=-4/sqrt(6)", "Ter Braak et al.,2002")
prc1.1 <- cbind(prc.R1, prc1.1999, prc.R3, prc.R2, prc1.2002)
colnames(prc1.1) <- c("R, constant= -4/sqrt(20)", "van den Brink et al, 1999", 
    "Eduard const", "R,constant=-4/sqrt(6)", "Ter Braak et al.,2002")
prc1.1
##      R, constant= -4/sqrt(20) van den Brink et al, 1999 Eduard const
## C-W0                0.000e+00                    0.0000    0.000e+00
## L-W0               -1.197e-16                    0.0000    9.313e-17
## H-W0               -8.419e-17                    0.0000    4.342e-17
## C-W1                0.000e+00                    0.0000    0.000e+00
## L-W1                1.656e-01                    0.1657   -1.698e-01
## H-W1                2.647e-01                    0.2647   -2.714e-01
## C-W2                0.000e+00                    0.0000    0.000e+00
## L-W2                1.656e-01                    0.1657   -1.698e-01
## H-W2                5.144e-01                    0.5144   -5.274e-01
## C-W3                0.000e+00                    0.0000    0.000e+00
## L-W3                7.819e-02                    0.0782   -8.017e-02
## H-W3                3.791e-01                    0.3791   -3.887e-01
##      R,constant=-4/sqrt(6) Ter Braak et al.,2002
## C-W0             0.000e+00                0.0000
## L-W0            -3.504e-16                0.0000
## H-W0            -2.349e-16                0.0000
## C-W1             0.000e+00                0.0000
## L-W1             3.023e-01                0.3023
## H-W1             4.832e-01                0.4829
## C-W2             0.000e+00                0.0000
## L-W2             3.023e-01                0.3023
## H-W2             9.391e-01                0.9386
## C-W3             0.000e+00                0.0000
## L-W3             1.427e-01                0.1427
## H-W3             6.921e-01                0.6916
zhenglei-gao commented 10 years ago

@EDiLD and @jarioksa

For the 2nd question, I am not sure we are talking about the same standard deviation. The standard deviation listed in the paper is sdt above, which seems to me more related to the design matrix, but I cannot reproduce it. And also they use rdt*sdt*const to get the canonical coefficients. The const for 1999 paper is 0.2 but for 2002 0.364917.

eduardszoecs commented 10 years ago

@zhenglei-gao :

First, I am not trying to reproduce the results in CANOCO but the results in the paper. The purpose is to have a GLP validated program to perform the PRC analysis. For that I need to show the authorities ( in the plant protection products area, who only accept CANOCO now) that using prc in vegan on my GLP-validated laptop can produce exactly(almost exactly) results in the literature where they have their results obtained from CANOCO, and apparently not with CANOCO 5 but with CANOCO 3 and 4. Their 1999 and 2002 results are using different constant themselves.

Will vegan be GLP validated or only your laptop (with the installed version of vegan)? Is CANOCO GLP? Would be good if we could use R for registration...

The current prc function in in vegan is written by Cajo ter Braak, and he assured that it will give identical results to CANOCO with scaling 1. As far as I remember (this happened in November 2011), Cajo said that other scaling could not be reproduce as easily.

Thats why we should compare to results from CANOCO (and not from the paper - since not all information/output is given there/rounding errors/etc...).

Do you have a CANOCO 3 license? Can you reproduce the results from the paper with CANOCO? [And share the results?] Also the pyrifos-data is nice to play with/compare results.

What paper is the Ter Braak et al.,2002 one? Or is it the CANOCO manual?

From my point of view this is just a scaling issue and not that important for interpretation of results, but I don't know if this can be communicated. Also not that you'll also need to adapt the permutation scheme...

Maybe I can take a closer look on this issue by the end of the week....

jarioksa commented 10 years ago

Obviously you need to develop a test case. First, you should work out what the results should be, and then see how to get those results with vegan. There will be some limitations: we only have one type of scaling of scores in vegan::prc, and I have no idea how to expand the scaling options. Cajo said it would need some work, but that is a work I'm not interested in. Somebody else should do it.

If you are looking at modifiers (constants) for numbers, you should not do that by trial and error: you can find a constant for one case, but then you need new trial and error for the next data set. You really must look at the code to see how things were done. For instance, the coefficients you get with coef(mod) and in summary(coef(mod)) will be different (and the plot() uses coef from summary). The standard coef(mod) uses coefficients that apply for orthonormal LC scores of sites, whereas summary(coef(mod)) uses coefficients that apply to scaling=3 transformation of those scores. It may be that only this scaling=3 will be consistent with similarly scaled CANOCO PRC, but I don't know. If the published results are based on some other scaling, we may not be able to reproduce them in vegan. Then you should understand something about scaling of RDA results in vegan. For this you should read the vegan vignetted decision-vegan.pdf which explains how vegan finds the constant it uses, and how you change that choice vegan made and set up your own constant. Then you only need to figure out how they did that in the paper you were reading (or in CANOCO which may be different). The document was written to be read by those who need its information. Please read it before your blind trials on finding constants.

Finally about test cases: your test case is certainly inadequate because you run out of degrees of freedom.

zhenglei-gao commented 10 years ago

@EDiLD ,

Will vegan be GLP validated or only your laptop (with the installed version of vegan)? Is CANOCO GLP? Would be good if we could use R for registration...

For GLP-validated softwares, I think it depends on the company policy. In my case, I need to provide the QA(Quality Assurance) Unit a validation plan and they approve it, then provide the test results to show that I can produce same results as expected in the validation plan. So vegan is only validated on that specific laptop for the version I used. I have a CANOCO 4.5 on my computer, it is also not GLP-validated. To make it validated I have to go through the same procedure. But because it is more often used in this area so there is standard validation plan for it, in which they use the simulated data.

Do you have a CANOCO 3 license? Can you reproduce the results from the paper with CANOCO? [And share the results?] Also the pyrifos-data is nice to play with/compare results. I have only CANOCO 4.5 license but I can reproduce the results and share them(I have tried it once. There are many clicks to go through). I will try to do it later this week also .

What paper is the Ter Braak et al.,2002 one? Or is it the CANOCO manual? I think it is the CANOCO Manual.

I agree that it might be just a scaling issue but it is better if I can provide a very clear description for the QA people and the authorities.

eduardszoecs commented 10 years ago

I have a CANOCO 4.5 on my computer, it is also not GLP-validated.

Can you run the pyrifos data (supplied with vegan) through CANOCO as a test-case? And provide the results?

jarioksa commented 10 years ago

About validating vegan: vegan is open source so you can read its code. The prc and related code is pretty short and it is not so difficult to read. Here my summary of things that you should do to understand the results:

1) the proper vegan::prc() function is only an interface that packs the variables (response, treatment, time) in a suitable matrices and calls RDA. The RDA is similar as CANOCO output if you look at the orthonormal scores (i.e., those for which sum of squares = 1 for each axis, and axes are uncorrelated). The differences to CANOCO come from scaling the results.

2) You need to understand the scaling. In CANOCO you define the scaling before the analysis and it is fixed for the result. In vegan, the results are scaled when you ask them, and you can change the scaling. The scaling is briefly explained in ?scores.rda help page, and more fully in the vignette decision-vegan.pdf (see ?vegandocs).

3) what is special in vegan::prc and differs from vegan::rda is that [canonical] coefficients are found with respect to scaled results (so you need to study scaling), whereas in vegan::rda they are found with respect to the orthonormal (unscaled) results. The standard coef() function for the vegan::rda results will give you different results than those used in prc plots or summaries. If mod is your prc result, you should use coef(summary(mod)) instead of coef(mod). The coefficients so defined will be dependent on scaling and constant so that you should set those in the call to summary (e.g. coef(summary(mod, scaling=3, const=myconstant))). In all cases, you should work out the const from the principles instead of trial and error.

jarioksa commented 10 years ago

pyrifos data is in CANOCO, and you can run it directly. It was included in vegan with the permission of Cajo ter Braak. I think CANOCO manual explains how to do PRC with pyrifos.

zhenglei-gao commented 10 years ago

@jarioksa,

I will communicate with the QA to see if I can switch to or add pyrifos test case.

The constant modifiers I tried are actually based on the vegan vignetted decision-vegan.pdf. I couldn't figure it out from the principles and then I used trial and error which produces the number 4 . I still need to understand why. I wish I can also read the CANOCO code so that I can understand how they do it.

About the pyrifos data, I will also do it in CANOCO 4.5 and share the results here.

jarioksa commented 10 years ago

Exactly 4 does not make any sense to me (although the rank of log(spdata) is 4). However, about 4 (more exactly 3.9975) is the constant that you need to get the standard PCA scaling (prcomp, princomp row in Table 2 in decision-vegan.pdf): sqrt((n-1) * sum-of-all-eigenvalues). In this case with your mod, const = sqrt((nrow(spdta)-1) * mod$tot.chi). For standard PCA scaling you also should set scaling = 1.

zhenglei-gao commented 10 years ago

@jarioksa

Now you mentioned it, it seems to make sense to me. I can try that with pyrifos data. I don't have the laptop with me. I will update later whether I can get same results with pyrifos.