vegandevs / vegan

R package for community ecologists: popular ordination methods, ecological null models & diversity analysis
https://vegandevs.github.io/vegan/
GNU General Public License v2.0
446 stars 96 forks source link

Oecosimu/bipartite C.score #116

Closed vicar66 closed 9 years ago

vicar66 commented 9 years ago

Hi! I am quite new to bioinformatic and I have difficulties to understand the meaning of the output of oecosimu. What mean the values of "statistics" , the different percentages and the "Pr (sim)"? I got told if Pr(sim)=0.014, that means 1,4% is explained by randomization but I am not totally sure.

And about the C.score, I read that size of data.set will influence the value of C score, with larger data meaning lower C score. A paper found that lot of randomization models showed null hypothesis rejected with non-corrected C score, but after being corrected, these showed that nulll hypothesis was accepted. I am studying microbial communities so I was wondering if there are correction patterns as i have large dataset (1094 OTUs for 20 samples).

Thank you!

jarioksa commented 9 years ago

The statistic is the statistic you calculated for your data. In this case it should be the C-score.

You simulate data using null models. The percentiles describe the distribution of your statistic from these null models. They describe the distribution of your statistic under null model.

The Pr(sim) is the P-value. It gives the proportion of your simulated values that are more extreme or as extreme as the statistic (C-score) you calculated from your data (what is "more extreme" depends on the direction of your test, see argument alternative). If there are not many such extreme simulated cases, it is customary to say that your observed statistic differs significantly from null model. In this case you got P = 0.014 which is lower than the limit P = 0.05 that many people use as a threshold of "significant result". You may see this as an order statistic: Assume you had 999 null model and the statistic evaluated for all these. Then you order these null model statistics plus your observed statistic from most extreme to the least extreme. Your observed statistic will be at rank 14 on this list: there are 13 more extreme simulated values and 986 less extreme. Because your observed statistic is much more extreme than the bulk of the simulated values, you say that your observed statistic hardly comes from the null models.

As to "correcting" C-scores: check the literature. However, the simulation is based on the same data as the calculation of the C-score. I would not be surprised if larger data give more significant results. I had "correcting" in quotes because I do not know what was changed in C-scores. Probably its calculation was changed by some way (unknown to me), but I have no idea what this change corrects and how.

jarioksa commented 9 years ago

I was intrigued about correcting C-score. I searched literature on the web, but could not find any obvious source of this correction. Could you provide a reference? (The only correction I found was about correcting sequential null models when using C-score.)

vicar66 commented 9 years ago

Hi Jari! really sorry for my long reply. Thanks for the detailed answer, it is way more easy to understand now! But then are the rest of the statistic (Z, mean, etc.) relevant to further assess non-randomization? Basically i have microbial communities and i want to test first if their distribution in diferent depths and locations is random or not.

For the correction, I lost track of the paper I was reading but when I get it again, I will post it here! Thanks again!

vicar66 commented 9 years ago

statistic z mean 2.5% 50% 97.5% Pr(sim.)
statistic 0.1986 1.8801 0.19803 0.19761 0.19801 0.1986 0.03 *

I have been relaunching an oecosimu on another dataset (99 simul). There i understand only 2 are more extreme,therefore my observed statistics hardly comes from the null models, thereby showing eventual distribution patterns.

But I am not totally sure to understand the word ''extreme''!

gavinsimpson commented 9 years ago

But I am not totally sure to understand the word ''extreme''!

What that means is unusually large or small. Consider a two-tailed t test; the rejection region for the test, at 95% confidence, lies 2.5% in the upper tail and 2.5% in the lower tail. In the upper tail, large positive t values are evidence against the null, but in the lower tail, large negative values are evidence against the null. If we said "as large or larger" in place of "extreme" some people would interpret that as large positive values only which may not be correct depending on the type of test they are doing. Hence we us "extreme" to indicate a value that is unusually large in absolute value.

Also, what counts as extreme will depend on the confidence level of the test; 95%, 99%, 99.9% etc.

jarioksa commented 9 years ago

Closed because of long lasting silence (probably solved).