revise PCA lesson - Githubissues

tavareshugo commented 5 years ago

[x] Remove usage of autoplot(), which caused some confusion
[x] instead explain what the prcomp object is and how to extract things from it "manually"
[ ] use top_n() to get top PC loadings

tavareshugo commented 5 years ago

New iteration of the lesson was trialled 2019-07-04 here. The lesson was revised to have detailed explanation of prcomp object.

However, this was perhaps a bit too much the other way... There's too much wrangling syntax necessary to do everything "manually". Need to find a better compromise between syntax (preparing matrix for PCA and extract stuff from it) and concept (what is a PCA and how to interpret output).

Some ideas from post-lesson debriefing and sticky feedback:

after explaining what PCA is, also explain that a biplot is usually used to represent the 3 main pieces of information together: the eigenvalues (in the axis labels), the PC scores (as points) and the variable loadings (as arrows/vectors). Then explain that in our case, we will not do a full biplot, because we would have 6000 loadings being plotted!
retain steps 1-3 of the [first exercise](https://tavareshugo.github.io/data-carpentry-rnaseq/00_exercises.html#31_examine_prcomp()_output), so students realise this is a non-standard object. Now we know what's in there, motivate that we will use tools that help extract/visualise the results from the PCA.
simplify the syntax. Few possibilities:
- use the broom package to tidy/augment the prcomp output. This fits with the rest of the course which focuses on tidyverse tools. It's an intermediate between "all manual" and "all automatic".
- use the factoextra package. This is more "all automatic", but has many tools to help visualise results from multivariate analysis tools. I also don't think this is as flexible, to be able to colour the points as one might wish...
- get back to using the autoplot function
Start with screeplot first, then move on to the PC scores graph.
extracting the "top" variable loadings along each PC is too much wrangling. In practice one would hardly ever look at the loadings of such massive matrices (with thousands of variables). So it could be left as an optional material that is not covered in the lesson.
- Could consider instead of showing it as "biplot"-style plot, show it as a "correlation-style" graph corrplot::corrplot(sample_pca$rotation, is.corr = FALSE). Would still need to extract top few and focus on only a few of the PCs.
- alternatively just point out that it's easy to ask the question: what were the top genes with highest loading on PC1? That might be enough really...

scale the data in prcomp
A few sticky notes said it would be nice to see examples applied to other data. Could actually start with the iris dataset, which is simple enough to grasp and ties in nicely with the animation shown at the start. Then possibly prepare the expression data with the students (transpose matrix) and then let them run the prcomp on their own in the full data...

tavareshugo commented 5 years ago

Possible simplification using broom:

# load packages
library(tidyverse)

# read the data
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")

# Create a transposed matrix from our table of counts
pca_matrix <- trans_cts %>% 
  column_to_rownames("gene") %>% 
  t()

# Perform the PCA
sample_pca <- prcomp(pca_matrix, scale. = TRUE)

# Extract tidy output
library(broom)
pc_scores <- augment(sample_pca) %>% 
  rename(sample = .rownames)
pc_eigenvalues <- tidy(sample_pca, matrix = "pcs")
pc_loadings <- tidy(sample_pca, matrix = "variables")

# screeplot
pc_eigenvalues %>% 
  ggplot(aes(PC, percent)) +
  geom_col()

# PC plot (with bonus ellipse?)
pc_scores %>% 
  full_join(sample_info, by = c("sample")) %>% 
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(colour = factor(minute))) +
  geom_polygon(stat = "ellipse", 
               aes(colour = factor(minute), fill = factor(minute)), 
               alpha = 0.1)

Then, the loadings part could be simplified to just say that we can pull the top genes with highest loading if we want to know who they are:

# alternatively could just say that for such a big matrix it's hard to interpret loadings
# but we could find out which genes have highest loading by:
pc_loadings %>% 
  filter(PC == "2") %>% 
  top_n(10, abs(value))

Finally, could show shortcut using factoextra:

library(factoextra)
fviz_pca_ind(sample_pca)

# illustrate why the biplot is useless in this case
fviz_pca_biplot(sample_pca)

tavareshugo / data-carpentry-rnaseq

revise PCA lesson #1