tavareshugo / data-carpentry-rnaseq

https://tavareshugo.github.io/data-carpentry-rnaseq/
21 stars 4 forks source link

revise PCA lesson #1

Open tavareshugo opened 5 years ago

tavareshugo commented 5 years ago
tavareshugo commented 5 years ago

New iteration of the lesson was trialled 2019-07-04 here. The lesson was revised to have detailed explanation of prcomp object.

However, this was perhaps a bit too much the other way... There's too much wrangling syntax necessary to do everything "manually". Need to find a better compromise between syntax (preparing matrix for PCA and extract stuff from it) and concept (what is a PCA and how to interpret output).

Some ideas from post-lesson debriefing and sticky feedback:


tavareshugo commented 5 years ago

Possible simplification using broom:

# load packages
library(tidyverse)

# read the data
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")

# Create a transposed matrix from our table of counts
pca_matrix <- trans_cts %>% 
  column_to_rownames("gene") %>% 
  t()

# Perform the PCA
sample_pca <- prcomp(pca_matrix, scale. = TRUE)

# Extract tidy output
library(broom)
pc_scores <- augment(sample_pca) %>% 
  rename(sample = .rownames)
pc_eigenvalues <- tidy(sample_pca, matrix = "pcs")
pc_loadings <- tidy(sample_pca, matrix = "variables")

# screeplot
pc_eigenvalues %>% 
  ggplot(aes(PC, percent)) +
  geom_col()

# PC plot (with bonus ellipse?)
pc_scores %>% 
  full_join(sample_info, by = c("sample")) %>% 
  ggplot(aes(.fittedPC1, .fittedPC2)) +
  geom_point(aes(colour = factor(minute))) +
  geom_polygon(stat = "ellipse", 
               aes(colour = factor(minute), fill = factor(minute)), 
               alpha = 0.1)

Then, the loadings part could be simplified to just say that we can pull the top genes with highest loading if we want to know who they are:

# alternatively could just say that for such a big matrix it's hard to interpret loadings
# but we could find out which genes have highest loading by:
pc_loadings %>% 
  filter(PC == "2") %>% 
  top_n(10, abs(value))

Finally, could show shortcut using factoextra:

library(factoextra)
fviz_pca_ind(sample_pca)

# illustrate why the biplot is useless in this case
fviz_pca_biplot(sample_pca)