vegandevs / vegan

R package for community ecologists: popular ordination methods, ecological null models & diversity analysis
https://vegandevs.github.io/vegan/
GNU General Public License v2.0
461 stars 97 forks source link

AIC calculation from cca/rda objects #400

Open jonpeake opened 3 years ago

jonpeake commented 3 years ago

I'm having some issues understanding what is going on in the extractAIC.cca function for use in the step() function. In the documentation, it is claimed that cca and rda methods don't have AIC, and that the step function shouldn't be used and instead ordistep should be used; however, Legendre and Anderson's 1999 paper on dbRDA and the subsequent paper from Godínez-Domínguez and Freire (2003) titled "Information-theoretic approach for selection of spatial and temporal models of community organization" give a logical framework for calculating AIC and using it as a selection criterion for model building in investigations of ecological communities. Indeed, you cite this Godinez-Dominguez and Freire paper in the documentation for the extractAIC.cca function. However, you state in the details of the function the following:

"The functions find statistics that resemble deviance and AIC in constrained ordination. Actually, constrained ordination methods do not have a log-Likelihood, which means that they cannot have AIC and deviance. Therefore you should not use these functions, and if you use them, you should not trust them. If you use these functions, it remains as your responsibility to check the adequacy of the result."

I am curious as to where this implication comes from. I can't seem to find any papers discrediting the framework from Godinez-Dominguez and Freire. Any information you might have would be incredibly helpful.

jarioksa commented 3 years ago

Looking at the history, I implemented extractAIC function in 2004 and haven't touched that much since then. So my memories are not quite fresh. However, the statement in the vegan help still sounds sound: basically information criteria are based on log-Likelihood, but CCA or RDA do not have log-Likelihood as they are not Maximum Likelihood methods. I don't think any of these papers you cite claim that there is a log-Likelihood. What they suggest and what vegan does is to have a statistic (deviance) which looks like likelihood-based (but is not). In general, the AIC criterion seems to work quite nicely when you have 1-df constraints, but the degrees-of-freedom penalty for multi-level factors may not be quite as neat. I don't want to discredit the paper you cited, but I'd rather like to have support for its method. Compared against other methods (such as ordistep) it often gives similar results, but when you have multilevel factors, you can have divergence. The help text in vegan that you cite just urges you to use permutation-based approaches that are more robust. It is just this experience of discrepancies behind these statements (I don't care about missing papers if my own studies so something else – and don't say these experiences should be published: they would not be accepted in any decent journal but they would be regarded as uninteresting – I'm experienced, like Jimi Hendrix put it).

jonpeake commented 3 years ago

Fair enough. One more question; I was digging into the source code a little bit, and for the rda deviance (which is used as the RSS in calculating AIC) the calculation is the sum of residual eigenvalues multiplied by the number of observations minus one. In the original 2003 Godinez-Dominguez and Freire paper (and the Legendre and Anderson 1999 paper), the calculation of RSS is equal to the sum of all eigenvalues minus the sum of the canonical eigenvalues (which should be equal to just the sum of residual eigenvalues). Is there a reason for multiplying this by the number of observations minus one?

jarioksa commented 3 years ago

The eigenvalues in vegan are based on scaled data so that they add up to variance instead of direct sum of squares, and we undo this scaling by multiplying with the number of observations minus one.