Methods to identify the number of nontrivial axes in wcmdscale

eduardszoecs commented 10 years ago

Paul and Anderson (2013) discuss 4 methods to assess how many numbers of axis might be usefull/interpretable/[putwhatyouwanthere]

a) broken-stick [implemented in vegan] b) bootstrap CI for eigenvalues c) bootstrap from null model d) variation of c)

I wondered if b)-c) are already implemented in vegan? And if not, if it's worth implementing them?

I have somewhere on my machine a more veganized version of their code...

Paul WL, Anderson MJ (2013) Causal modeling with multivariate species data. Journal of Experimental Marine Biology and Ecology 448:72–84. doi: 10.1016/j.jembe.2013.05.028

jarioksa commented 10 years ago

The methods that Paul & Anderson discuss look pretty simple and could be implemented. I'm not fond of these methods because I don't think they work. With "not working" I mean that I don't think that they usually give the correct number of "non-trivial" axes even when such a number exists. This also applies to bstick and screeplot that we now have in vegan. @gavinsimpson seems to think differently, though, and he may have some ideas. However, as we have bstick we could also have other simple alternatives.

An ecological classic on the number of axes is

Jackson, DA (1993) Stopping rules in principal components analysis: a comparison of heuristical and statistical approaches. Ecology 74, 2204-2214.

Paul & Anderson cite this paper and say they are implementing one of his heuristic methods.

Then about the aspect of "not working" and why I think so. You can use simulate.rda and simulate.capscale functions to produce random matrices of known number of "non-trivial axes" (rank) plus random error. Here I generate 2-dimensional data with capscale:

> library(vegan)
> data(mite)
> m <- capscale(vegdist(wisconsin(sqrt(mite))) ~ 1)
> m2 <- capscale(vegdist(wisconsin(sqrt(mite))) ~ scores(m, dis="sites", choice=1:2))  
> signbstick <-
function (model, rank="full") 
{
    mod <- capscale(simulate(model, rank=rank) ~ 1)
    max(min(which(bstick(mod) > eigenvals(mod))) - 1, 0)
}
> table(replicate(100, signbstick(model=m2)))

 2  3  4  5  6 
12 24 44 17  3

This generates data with two axes plus independent random normal error and estimates the residual variance from the residuals of the model (there are other choices). In capscale simulations we ignore negative eigenvalues, though. This gives the "correct" number of axes only in 12/100 cases, and the modal case is four axes. However, if we reduce the residual random variation, we get correct results:

> m5 <- capscale(vegdist(wisconsin(sqrt(mite))) ~ scores(m, dis="sites", choice=1:5))
> table(replicate(100, signbstick(model=m5, rank=2)))

 2  3 
94  6

It is not only the number of systematic axes, but the signal/noise ratio: the non-trivial axis must be strong relative to residual variation. Unfortunately, if that is not the case, bstick works in a wrong way and suggests too high a number of axes. Similar tests could be useful for alternative methods, too.

eduardszoecs commented 10 years ago

Thanks Jari for your thoughts (and the ref to jackson 1993)!

Peres-Neto, P. R., D. A. Jackson, and K. M. Somers. “How Many Principal Components? Stopping Rules for Determining the Number of Non-Trivial Axes Revisited.” Computational Statistics & Data Analysis 49, no. 4 (2005): 974–97.

also discusses a lot of methods. Most of them should be easy to implement.

Your examples are about constrained ordinations (capscale()) - do your critics also apply to unconstrained ordinations?

jarioksa commented 10 years ago

Here is another paper that I found in last spring (northern hemisphere) and found then useful, mainly in explaining why things like broken stick should underestimate first eigenvalues, and for underlying geometry in particular:

Johnstone IM (2007) High dimensional statistical inference and random matrices. Proceedings of the International Congress of Mathematicians, Madrid, Spain, pages 307-333.

This is a conference paper, but I found this on the Internet so that it should be available.

My example above actually studied unconstrained ordination: it only constrained the ordination by first unconstrained axes to get the desired dimensionality to the systematic part, and to get matching random variation.

gavinsimpson commented 10 years ago

I don't think differently; I consider these at best a guide to the potential/likely dimensionality, not some infallible, all-knowing method that can discern the will of <_insert favourite deity_>.

Bootstrapping eigenvalues seems at first blush a great solution until you realise that all this gives you is say a confidence interval on the eigenvalues and some measure of bias in the values you got for your data. You still have the problem of saying whether the eigenvalues +/- the bootstrap CI are meaningful or not.

I need to revisit Pedro's paper to recall what they did. If we think these are useful as guides and perhaps may perform better than bstick() we should consider implementing them but with big caveats. If anything, having a well-implemented version of these methods in vegan would make it easier to use in teaching to illustrate that these should not be trusted or take at face value.

gavinsimpson commented 10 years ago

I would add that the more formal approaches to doing unconstrained ordination, such as using latent variable models, would allow a real statistical (as opposed to an heuristical) solution to the problem as then we'd have likelihoods to generate an LRT, or use information criteria, to choose between models with different numbers of latent variables (axes).

The longer-term usage of vegan will either decline as these new-fangled methods gain traction (which undoubtedly they will, eventually), or not, if we choose to add vegan-esque implementations. Thoughts?

jarioksa commented 7 years ago

This issue was raised in StackOverflow.

vegandevs / vegan

Methods to identify the number of nontrivial axes in wcmdscale #78