joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:
http://joey711.github.io/phyloseq/
567 stars 187 forks source link

technical question using ordinate phyloseq function with envfit of vegan Package #1761

Open cabraham03 opened 1 week ago

cabraham03 commented 1 week ago

Hi. I have a technical question. I want to generate a NMDS plot along with the OTUs based on the envfit statistics. If I have less than 3,000 OTUs I don’t have any problem, but if I have a >19,000 OTUs my machine just simply stop or get slow with envfit.

So, I want to generate a NMDS ordination using ordinate phyloseq function, and then obtain the distribution of the Samples ($points) and the distribution of the Species ($species), and then make a envfit analyses and used the significant (< 0.05) to filter and plot them together, but using the distribution generated with ordinate function.

I will use GlobalPatterns as examples, its a good example because it present >19,000 OTUs

GP <- GlobalPatterns
set.seed(1000)
GP.ord <- ordinate(GP, "NMDS", "bray", k=3)

extract Samples (points)

Sites <- data.frame(GP.ord$points)
Sites$SampleID <- rownames(Sites)
SD <- data.frame(sample_data(GP))
SD <- SD[, c(1,6)]
colnames(SD)[1] <- "SampleID"
Sites <- inner_join(Sites, SD, by = "SampleID")

now Species

Species <- data.frame(GP.ord$species)
Species$OTUsID <- rownames(Species)
taxt <- data.frame(as(tax_table(GP), "matrix"))
taxt$OTUsID <- rownames(taxt)
Species <- inner_join(Species, taxt, by = "OTUsID")
rownames(Species) <- Species$OTUsID                         
Species$OTUsID <- NULL

To avoid problems, I will generate a filter and then use envfit to obtain the significant

wh0 = genefilter_sample(GP, filterfun_sample(function(x) x > 5), A=0.3*nsamples(GP))
GP1 = prune_taxa(wh0, GP)
set.seed(1000)
GP1.ord <- ordinate(GP1, "NMDS", "bray", k=3)
Otus <- data.frame(as(otu_table(GP1), "matrix"))
Otus <- t(Otus)

now envfit

set.seed(1000)
spp.fit <- envfit(GP1.ord, Otus, permutations = 999, choices=c(1:3))

obtain the significant

spp.fit_df <- data.frame(cbind(r2 = spp.fit$vectors$r, spp.pvals = spp.fit$vectors$pvals ))
spp.fit_df <- spp.fit_df[spp.fit_df$spp.pvals <= 0.05, ]

then filter the significant OTUs from the Species generated with ordinate function


head(Species, 3)
             MDS1        MDS2       MDS3 Kingdom        Phylum        Class        Order        Family      Genus                  Species
549322 -1.2016958 -0.01723596 -0.3002742 Archaea Crenarchaeota Thermoprotei         <NA>          <NA>       <NA>                     <NA>
522457 -1.2311695  0.08762658 -0.7164607 Archaea Crenarchaeota Thermoprotei         <NA>          <NA>       <NA>                     <NA>
951    -0.4113979  1.48264357  0.2590104 Archaea Crenarchaeota Thermoprotei Sulfolobales Sulfolobaceae Sulfolobus Sulfolobusacidocaldarius

head(spp.fit_df, 3)
              r2 spp.pvals
250392 0.4012361     0.001
365828 0.3045288     0.017
12812  0.4500423     0.001

Species.sig <- Species[rownames(Species) %in% rownames(spp.fit_df), ]

and then generate a plot !!


ggplot(Sites, aes(x = MDS1, y = MDS2)) +                                                                           # Sites
    geom_vline(xintercept = 0, linetype="dashed", size = 0.5, color= "#999999") +                                  # V line
    geom_hline(yintercept = 0, linetype="dashed", size = 0.5, color= "#999999") +                                  # H line
    geom_point(aes(fill = SampleType), size = 7, shape = 21 ,color = "00000000", alpha=1) +                        # Sites points
    scale_fill_manual(values = MyPalette1 ) +                                                             # Colors Points
    geom_point(data = Species.sig, aes(x=MDS1, y=MDS2, color = Family), size = 2.5, alpha=0.6, shape=18) +     # Species
    scale_color_manual(values = MyPalette2) +
    geom_segment(data = Species.sig, aes(x = 0, xend=MDS1, y=0, yend=MDS2), 
                              arrow = arrow(length = unit(0.25, "cm")), colour = "grey",
                              lwd=0.25, linetype="dashed", alpha=0.7) +                           
    theme_bw()

My question is, if I have so many OTUs is valid to filter and then obtain the statistics with envfit, and generate the distribution with ordinate phyloseq function ? I mean, the direction, and length of the arrow in Species is representative of the relatedness with the Samples distribution using this method ?

I know that I can generate the the plot with for Samples and Species obtained from the filter data (GP1) and generate the ordinate and envfit, but I rather to use all OTUs to generate the Sample Distribution !!!

any suggestion ? Thanks