jfukuyama / adaptiveGPCA

9 stars 0 forks source link

Make biplot with ellipses and external variable vectors #4

Open janstrauss1 opened 3 years ago

janstrauss1 commented 3 years ago

Dear @jfukuyama,

I would like to use adaptiveGPCA to analyze a marine microbiome data set and explore the environmental drivers of community composition. A similar example but using PCoA can be found at https://doi.org/10.1126/science.1261359 .

To do this, I'd like to include information of contiguous measurements of environmental parameters in biplots and generate biplots like in the example below which I found at https://doi.org/10.1371/journal.pcbi.1006907 using prcomp.

# install.packages(c("tidyverse", "factoextra"))
library("tidyverse")
library("factoextra")
#> Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

### Example taken from Nguyen LH, Holmes S (2019) Ten quick tips for effective dimensionality reduction. 
### PLoS Comput Biol 15(6): e1006907. https://doi.org/10.1371/journal.pcbi.1006907
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data"
wine <- read_csv(url, col_names = FALSE)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   X1 = col_double(),
#>   X2 = col_double(),
#>   X3 = col_double(),
#>   X4 = col_double(),
#>   X5 = col_double(),
#>   X6 = col_double(),
#>   X7 = col_double(),
#>   X8 = col_double(),
#>   X9 = col_double(),
#>   X10 = col_double(),
#>   X11 = col_double(),
#>   X12 = col_double(),
#>   X13 = col_double(),
#>   X14 = col_double()
#> )
colnames(wine) <- c("class", "Alcohol", "MalicAcid", "Ash", "AlcAsh", "Mg","Phenols", "Flav", "NonFlavPhenols", "Proa", "Color","Hue", "OD", "Proline")
head(wine)
#> # A tibble: 6 x 14
#>   class Alcohol MalicAcid   Ash AlcAsh    Mg Phenols  Flav NonFlavPhenols  Proa
#>   <dbl>   <dbl>     <dbl> <dbl>  <dbl> <dbl>   <dbl> <dbl>          <dbl> <dbl>
#> 1     1    14.2      1.71  2.43   15.6   127    2.8   3.06           0.28  2.29
#> 2     1    13.2      1.78  2.14   11.2   100    2.65  2.76           0.26  1.28
#> 3     1    13.2      2.36  2.67   18.6   101    2.8   3.24           0.3   2.81
#> 4     1    14.4      1.95  2.5    16.8   113    3.85  3.49           0.24  2.18
#> 5     1    13.2      2.59  2.87   21     118    2.8   2.69           0.39  1.82
#> 6     1    14.2      1.76  2.45   15.2   112    3.27  3.39           0.34  1.97
#> # … with 4 more variables: Color <dbl>, Hue <dbl>, OD <dbl>, Proline <dbl>

# note that Nebbiolo is a grape variety for Barolo
wine.class <- factor(wine$class, levels = c(1, 2, 3), labels = c("Nebbiolo", "Grignolino", "Barbera"))
wine <- wine[,-1]
table(wine.class)
#> wine.class
#>   Nebbiolo Grignolino    Barbera 
#>         59         71         48

winePCA <- prcomp(wine, scale = TRUE)

fviz_pca_biplot(
  winePCA, 
  geom = "point",
  col.ind = wine.class,
  col.var = "#c07d44",
  pointsize = 2.5,
  labelsize = 7,
  addEllipses = TRUE,
  ellipse.level = 0.7) + 
  coord_fixed() + 
  ggtitle("") + 
  ylim(-4, 4) +
  scale_fill_brewer(name = "Grape variety", palette = "Set1") +
  scale_color_brewer(name = "Grape variety", palette = "Set1") +
  scale_shape_manual(name = "Grape variety", values = c(16, 16, 16)) +
  theme(text = element_text(size = 25))

Created on 2021-05-07 by the reprex package (v2.0.0)

How could I accomplish this using adaptiveGPCA though?

Many thanks in advance for any suggestion!

jfukuyama commented 3 years ago

Hello @janstrauss1,

This is something you have to do a little bit by hand, but it's not too difficult. Using mostly the same code as in the vignette, we can do something like this to make ellipses:

library(adaptiveGPCA)
library(phyloseq)
data(AntibioticPhyloseq)
theme_set(theme_bw())
pp = processPhyloseq(AntibioticPhyloseq)
out.agpca = adaptivegpca(pp$X, pp$Q, k = 2)
ggplot(data.frame(out.agpca$U, sample_data(AntibioticPhyloseq))) +
    geom_point(aes(x = Axis1, y = Axis2, color = ind)) +
    stat_ellipse(aes(x = Axis1, y = Axis2, color = ind)) +
    xlab("Axis 1") + ylab("Axis 2")

This makes a scatter plot of the rows (= samples), and puts an ellipse around the points for each of the three individuals (different values of ind). The main thing that we do in the ggplot is make the object data.frame(out.agpca$U, sample_data(AntibioticPhyloseq)): the out.agpca$U part has the locations of the samples, and the sample_data(AntibioticPhylooseq) has the metadata for the samples. The locations of the samples are what you need to plot the points, and the metadata is needed for constructing the ellipses.

The output should look like this: ellipse-example

For microbiome data I usually do the biplot points and axes separately because there are too many variables, but you can see how to do the biplot axes in the vignette and if you want to combine them you can just do it the normal way you would in ggplot.

janstrauss1 commented 3 years ago

Many thanks for your feedback @jfukuyama! That already helps a lot.

Yet, I think I have a different biplot in mind than you and I would be additionally also very interested in learning how to draw vectors (i.e., the arrows in the wine example above) showing a projection of external/supplementary variables (e.g., sample attributes) onto the adaptiveGPCA ordination. I couldn't really find this addressed in the vignette.

My ultimate aim is actually to visualize the contribution of measured environmental parameters (continuous variables) to the microbiome composition of samples.

I guess in the AntibioticPhyloseq example data set the supplementary variables that I would like to show as vectors in the ordination plots would be the antibiotic treatment and time (and potentially maybe additional parameters like body weight, age etc. of the individuals if available). Probably the AntibioticPhyloseq data might not be the best example though since it doesn't seem to include continuous supplementary variables except time.

Hope this makes sense.

Many thanks in advance for any help and suggestion!

janstrauss1 commented 3 years ago

Dear @jfukuyama,

I realized that the the function envfit from the R package vegan can fit environmental vectors of factors onto an ordination.

I haven't managed to apply it with the output of adaptivegpca though, yet.

jfukuyama commented 3 years ago

Hello @janstrauss1,

I think I see what you mean. I'm not sure exactly what envfit does, but if it's the same thing as the supplemental vectors described e.g. on page 47 here, then you could do it like this:

First set up, same as before:

library(adaptiveGPCA)
library(phyloseq)
library(ggrepel)
data(AntibioticPhyloseq)
theme_set(theme_bw())
pp = processPhyloseq(AntibioticPhyloseq)
out.agpca = adaptivegpca(pp$X, pp$Q, k = 2)
bp = ggplot(data.frame(out.agpca$U, sample_data(AntibioticPhyloseq))) +
    geom_point(aes(x = Axis1, y = Axis2, color = ind)) +
    stat_ellipse(aes(x = Axis1, y = Axis2, color = ind)) +
    xlab("Axis 1") + ylab("Axis 2")

Then we can compute the supplemental points for the variables corresponding to the individuals. The supplemental vectors are just given by the regression coefficients, so we can calculate those and then plot them. I'm doing ~ 0 + ind in the model because I want one coefficient for each individual, this is something that is specific to factor variables and wouldn't be used if ind were continuous.

ind_lm = lm(out.agpca$U ~ 0 + ind, data = data.frame(sample_data(AntibioticPhyloseq)))
coefs_ind = data.frame(coef(ind_lm))

bp + geom_segment(aes(x = 0, y = 0, xend = Axis1, yend = Axis2),
                  data = coefs_ind, arrow = arrow(length = unit(.01, "npc"))) +
    geom_text_repel(aes(x = Axis1, y = Axis2, label = rownames(coefs)), data = data.frame(coefs))

which gives ind-example

We can do the same thing for the time variable just to show what it would look like with a continuous variable:

time_lm = lm(out.agpca$U ~ time, data = data.frame(sample_data(AntibioticPhyloseq)))
coefs_time = data.frame(coef(time_lm))[-1,] ## we don't want to plot the intercept

bp + geom_segment(aes(x = 0, y = 0, xend = Axis1, yend = Axis2),
                  data = coefs_time,
                  arrow = arrow(length = unit(.01, "npc")))

which gives time-example

As you noted, time isn't a very good variable to show as a supplemental vector because it is almost completely unrelated to either axis. This shows up in the arrow being very short, in the picture you can only actually see the arrowhead, not the segment itself.