Open tavareshugo opened 5 years ago
New iteration of the lesson was trialled 2019-07-04 here. The lesson was revised to have detailed explanation of prcomp
object.
However, this was perhaps a bit too much the other way... There's too much wrangling syntax necessary to do everything "manually". Need to find a better compromise between syntax (preparing matrix for PCA and extract stuff from it) and concept (what is a PCA and how to interpret output).
Some ideas from post-lesson debriefing and sticky feedback:
after explaining what PCA is, also explain that a biplot is usually used to represent the 3 main pieces of information together: the eigenvalues (in the axis labels), the PC scores (as points) and the variable loadings (as arrows/vectors). Then explain that in our case, we will not do a full biplot, because we would have 6000 loadings being plotted!
retain steps 1-3 of the [first exercise](https://tavareshugo.github.io/data-carpentry-rnaseq/00_exercises.html#31_examine_prcomp()_output), so students realise this is a non-standard object. Now we know what's in there, motivate that we will use tools that help extract/visualise the results from the PCA.
simplify the syntax. Few possibilities:
broom
package to tidy/augment the prcomp
output. This fits with the rest of the course which focuses on tidyverse
tools. It's an intermediate between "all manual" and "all automatic".factoextra
package. This is more "all automatic", but has many tools to help visualise results from multivariate analysis tools. I also don't think this is as flexible, to be able to colour the points as one might wish...autoplot
functionStart with screeplot first, then move on to the PC scores graph.
extracting the "top" variable loadings along each PC is too much wrangling. In practice one would hardly ever look at the loadings of such massive matrices (with thousands of variables). So it could be left as an optional material that is not covered in the lesson.
corrplot::corrplot(sample_pca$rotation, is.corr = FALSE)
. Would still need to extract top few and focus on only a few of the PCs.prcomp
iris
dataset, which is simple enough to grasp and ties in nicely with the animation shown at the start. Then possibly prepare the expression data with the students (transpose matrix) and then let them run the prcomp
on their own in the full data...Possible simplification using broom
:
# load packages
library(tidyverse)
# read the data
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")
# Create a transposed matrix from our table of counts
pca_matrix <- trans_cts %>%
column_to_rownames("gene") %>%
t()
# Perform the PCA
sample_pca <- prcomp(pca_matrix, scale. = TRUE)
# Extract tidy output
library(broom)
pc_scores <- augment(sample_pca) %>%
rename(sample = .rownames)
pc_eigenvalues <- tidy(sample_pca, matrix = "pcs")
pc_loadings <- tidy(sample_pca, matrix = "variables")
# screeplot
pc_eigenvalues %>%
ggplot(aes(PC, percent)) +
geom_col()
# PC plot (with bonus ellipse?)
pc_scores %>%
full_join(sample_info, by = c("sample")) %>%
ggplot(aes(.fittedPC1, .fittedPC2)) +
geom_point(aes(colour = factor(minute))) +
geom_polygon(stat = "ellipse",
aes(colour = factor(minute), fill = factor(minute)),
alpha = 0.1)
Then, the loadings part could be simplified to just say that we can pull the top genes with highest loading if we want to know who they are:
# alternatively could just say that for such a big matrix it's hard to interpret loadings
# but we could find out which genes have highest loading by:
pc_loadings %>%
filter(PC == "2") %>%
top_n(10, abs(value))
Finally, could show shortcut using factoextra
:
library(factoextra)
fviz_pca_ind(sample_pca)
# illustrate why the biplot is useless in this case
fviz_pca_biplot(sample_pca)
autoplot()
, which caused some confusionprcomp
object is and how to extract things from it "manually"top_n()
to get top PC loadings