vegandevs / vegan

R package for community ecologists: popular ordination methods, ecological null models & diversity analysis
https://vegandevs.github.io/vegan/
GNU General Public License v2.0
449 stars 96 forks source link

ordiR2step tossing the last factor #421

Open atfields opened 3 years ago

atfields commented 3 years ago

Hey Vegan devs,

I have recently been using ordiR2step to analyze my data, but I have found what I think is a small quark in the processing (vegan v 2.5-6). I have 4 independent variables and when I run ordiR2step, the first 3 variables are found to be significant. When ordiR2step comes to the last variable, the function finds that my last variable has a higher R2.adjusted than the "All variables" term. I find this strange because there is only 1 variable not in the model.

I think it could be a floating point error or something to do with the order based upon the names on the variables, but the numbers I can see on the screen suggest that the R2.adjusted are equal for both the variable and the "All variables" term. Using ordistep does find this variable to be significant as does a conditional RDA.

I could supply my data set upon direct message if you need it for testing.

Thank you for considering this issue and for your CRAN package!

jarioksa commented 3 years ago

Floating point errors are typically of magnitude 10-16 (and floating point errors are unavoidable in digital computers: it is a feature, not a bug). If the numbers look the same for the digits displayed, you should subtract the values to see their difference. It is quite possible that you are just that tiny epsilon above the critical level, and that stops your analysis. If this disturbs you, use argument R2scope = FALSE in ordiR2step and the R2 criterion will be ignored. After running the analysis, calculate the difference though. NB., even if you have only a single factor, that factor can have several levels and use several degrees of freedom, so dropping one factor is a bigger thing than dropping a single vector variable.

atfields commented 3 years ago

Great. Thank you very much!

atfields commented 3 years ago

FYI the difference I am getting is 1.110223e-16. That suggests to me a floating point error. While floating point errors are a function of digital computers, I have seen scripts which I think do take this into account (e.g. ape::pcoa).

jarioksa commented 3 years ago

We do add an epsilon in comparisons sometimes (for instance in vegan::wcmdscale which is our PCoA), but it does not remove the issue in this case: it only moves the threshold to another position.

jarioksa commented 3 years ago

I still expand on using epsilon in comparisons: we use it in cases where we expect cases that are theoretically equal, but can numerically different due to floating point issues. For instance, in principal coordinates analysis, we expect some zero eigenvalues (at least one, and in certain cases several), but numerically these are not exact zeros. In some permutation tests we actually may have only discrete points in real line (because they are permutations of observations!), but numerically they can differ. Dissimilarities are often based on integer values or otherwise discrete values in real line, giving discrete sets of dissimilarities which numerically differ by floating point accuracy. However, the only case where I would expect to have adjusted R2 that is equal within epsilon with a model having one more variable is when the added variable is orthogonal (i.e., completely redundant) and has zero effective degrees of freedom. Would that be your case? I think that managing these cases is on user's responsibility and if we reject that variable it is a happy accidence (it should be done, but we can't guarantee to do so). If that variable adds almost exactly the amount of information that is reduced by adjusting R2 by degrees of freedom, adding epsilon just moves the threshold, but does not solve an issue. Anyway, having this kind of functions that put strict limits to continuous world is an abomination (as are all automatic tools of model building, including ordiR2step).

atfields commented 3 years ago

Dear Dr. Oksanen,

I fear I may be out of my depth getting into the theory of eigandecomposition. If the added variable is orthogonal, I thought it would add 1 effective degree of freedom. If it was completely redundant (correlation of 1 with one of the other variables) it would add 0 effective degrees of freedom and 0 adjusted R2 . If two variables are completely redundant though, I am not sure how they could be orthogonal.

In my case, I compared the adjusted R2 of ordiR2step(R2scope=F) to the rda including all of the variables since I thought this would be the same comparison that was made in ordiR2step for the last variable. That is where I got the value I reported. In my case, the difference in adjusted R2 between 3 and 4 variables is 0.004, but this does not really answer the question about if there is a difference between the model including the 4th variable and the "All variables" term in ordiR2step. I am sorry for any confusion, but I appreciate your responses to this question.

I guess to get the full answer I may have to run the script piecemeal and see what is being calculated for this last step so I can make a direct comparison of the model including the last variable and the "All variables" model, but I just have not had time to dive into it that deeply yet; I was hoping my short cut would answer the question, but maybe it just adds a layer of confusion.

On Fri, May 21, 2021 at 11:47 AM Jari Oksanen @.***> wrote:

I still expand on using epsilon in comparisons: we use it in cases where we expect cases that are theoretically equal, but can numerically different due to floating point issues. For instance, in principal coordinates analysis, we expect some zero eigenvalues (at least one, and in certain cases several), but numerically these are not exact zeros. In some permutation tests we actually may have only discrete points in real line (because they are permutations of observations!), but numerically they can differ. Dissimilarities are often based on integer values or otherwise discrete values in real line, giving discrete sets of dissimilarities which numerically differ by floating point accuracy. However, the only case where I would expect to have adjusted R2 that is equal within epsilon with a model having one more variable is when the added variable is orthogonal (i.e., completely redundant) and has zero effective degrees of freedom. Would that be your case? I think that managing these cases is on user's responsibility and if we reject that variable it is a happy accidence (it should have done, but we can't guarantee to do so).

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vegandevs/vegan/issues/421#issuecomment-846097985, or unsubscribe https://github.com/notifications/unsubscribe-auth/AET7MMAEJP7Q3RB6S7CAMHTTO2FCXANCNFSM45JJV6TQ .

jarioksa commented 3 years ago

Sorry for being unclear: I intended to write about a variable that is orthogonal to Y-variates and has nothing to do with them (not among the X-variates), but then it still should have positive degrees of freedom. Anyway, I find it very hard to imagine a case like yours, where the improvement of model is almost exactly cancelled out by the increase of model degrees of freedom in adjusted R2 – except as a curious and rare incidence. However, it seems that dropping out the variable was the right thing to do. I also interpret your message so that it would be dropped by permutation tests even if the adjusted R2 scope failed to do so. Automatic procedures like ordiR2step make strong dichotomies that sometimes are based on tiny, unimportant and incidental differences – and sequential stepping is still more unstable and unreliable when there are several almost equal alternative models.