joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
579 stars 188 forks source link

UniFrac statistics #380

Closed JB-8 closed 10 years ago

JB-8 commented 10 years ago

Hi,

I would like to use UniFrac to find out if the microbial communities in my experiment are affected by treatment. I have managed to calculate a UniFrac matrix (weighted and unweighted) and performed and PCoA as well as hierarchical clustering on the data. This shows me that the data does in fact split quite well by treatment.

However, I would like to strengthen my analysis using statistics on the UniFrac matrices, such as bootstrapping and the P-test (described here: http://bmf2.colorado.edu/unifrac/help.psp#phylo_test).

Is it possible to do this in phyloseq or are you aware of any other package which could do so?

I would be very thankful for any help and can provide you with more details about my question if needed.

Best, Julia

spholmes commented 10 years ago

Hi Julia There are several bootstrap packages in R but I think that you could do this very easily just by reassigning groups at random to the samples and recomputing a within group/between group distance statistic using the unifrac distances you have already computed. If you have 20 samples you get a 20 by 20 unifrac distance matrix between them, with the first 10 in one group( call gr1 the grouping variable equal to 1 or 2) and the next 10 in the other say

mat2=as.matrix(dist2) within12=sum(mat2[gr1==1,gr1==1])+sum(mat2[gr1==2,gr1==2]) computes a within group statistic then permute the groups for as many simulations (=S) as you need perm=rep(0,S) for (s in (1:S)){ gr1=gr1[sample(20,20,T)]

and recompute within12

perm[s]=sum(mat2[gr1==1,gr1==1])+sum(mat2[gr1==2,gr1==2]) } length(which(within12<perm))/S

Hope this makes sense, Susan

JB-8 commented 10 years ago

Hi Susan,

thanks for you reply. I'm not very experienced using these kinds of statistics in R and don't fully understand your suggestion - could you help me to clarify what you mean?

My UniFrac matrix looks like this (part of the matrix):

screen shot 2014-08-28 at 12 37 22

D/ND and DIL/NDIL are the 4 cross-treatments and the numbers are the blocks (each treatment was replicated 6x).

This is what the cluster dendrogram from unweighted UniFrac values, clustered with "Ward's minimum variance" looks like: screen shot 2014-08-28 at 12 46 53

To implement your suggestion, I have added the grouping variable as a column as well as a row named "gr1":

screen shot 2014-08-28 at 13 15 31

and then tried to implement your code, but I'm not sure if I'm doing this right:

screen shot 2014-08-28 at 13 40 42

This is the output:

screen shot 2014-08-28 at 13 40 29

I have to say that don't really understand this result and also don't know how I would get from this to bootstrap values for the hierarchical clustering tree…

I hope my reply is not too confusing and would be very thankful for further help! Julia

joey711 commented 10 years ago

Julia,

You say "4 cross-treatments", is this two treatments, and their respective controls? Do you have paired samples? Is there a relevant time sequence (e.g. a "crossover" design)? It sounds like you have more structure to this experimental design, and might be able to use that to ask more precise questions.

joey

JB-8 commented 10 years ago

Hi, After some more trying I think I got the way Susan suggested to test it working.

I calculated the within-statistic for each treatment using: mat2=as.matrix(masterUF) rownames(mat2)<-c(rep(c(1,2),each=12)) colnames(mat2)<-c(rep(c(1,2),each=12)) within12=sum(mat2[rownames(mat2)==1,colnames(mat2)==1])+sum(mat2[rownames(mat2)==2,colnames(mat2)==2])

Then I performed 1000 random permutations: S=1000 perm=rep(0,S) for (s in 1:1000){
gr1=c(rep(c(1,2),each=12)) tempgr<-gr1[sample(24,24,T)] rownames(mat2)<-tempgr colnames(mat2)<-tempgr perm[s]=sum(mat2[rownames(mat2)==1,colnames(mat2)==1])+sum(mat2[rownames(mat2)==2,colnames(mat2)==2]) }

and finally checked how often the within group statistic was larger than the random permutation: 1-length(which(within12_Daph<perm))/S

I think this should do it for now. If, however, you have any further ideas and suggestions of how to calculate UniFrac statistics in phyloseq/R let me know!

Thanks for your help, Julia

joey711 commented 10 years ago

The reason I asked about your experimental design is because adonis in the vegan package provides some useful testing framework for this sort of thing, but you have to provide it the experimental design using the formula interface (e.g. ~Subject + Period + Treatment). I believe we've some examples of this floating around. Try searching for "adonis" and "phyloseq"

Glad it sounds like you've solved your issue, though. I'll close this for now.