would love more descriptive column `covariate`

adw96 commented 5 months ago

I'm working on a lab for radEmu (relevant lines below), and noticed something suboptimal about printining ch_fit vs ch_fit$coef %>% head

The column covariate in ch_fit$coef %>% head lists GroupCRC, while in ch_fit it lists 1.00000. The latter isn't helpful for helping me see which level of the factor is the reference group vs non-reference (which is important for a default printing function!).

Could we please have a modification to the printing function for ch_fit to allow covariate to be a string?

Thank you!

### A differential abundance lab
###
###
### Prepared for the University of Wisconsin Madison's Methods for Biological Data Workshop
### on April 3 2024 by Amy Willis, David Clausen and Sarah Teichman
###
###

### In this lab we'll investigate which microbial strains are differentially abundant
### in the gut microbiomes of people with colorectal cancer (CRC) compared to those without CRC
###
### Note that the methods used in this paper are not specific to microbiome -- they are suited to
### differential abundance of any biological units, and are tailored to settings where
### high-throughput sequencing is used. So, these tools could be applied to transcriptome,
### RNAseq, single-cell seq, etc...
###
### These data are published by Wirbel et al. (2019).
### (https://www.nature.com/articles/s41591-019-0406-6)
### This is a meta-analysis of case-control studies, meaning that Wirbel
### et al. collected raw sequencing data from studies other researchers
### conducted and re-analyzed it (in this case, they also collected some
### new data of their own).

### Wirbel et al. published two pieces of data we'll focus on today:

# 1. metadata giving demographics and other information about participants

# 2. an mOTU table (a sample-by-category table, where the categories are "metagenomic OTUs",
#    a proxy for bacterial strains)

### To speed up computational time for the purposes of illustration,
### we'll look at a subset of all 849 mOTUs Wirbel et al. published

### We're most interested to compare microbial abundance in cases diagnosed
### with colorectal cancer to abundances in controls (without this diagnosis)

### To help you practice formatting your data into the needed form,
### this lab will work from the data published by Wirbel et al, and talk
### you through formatting it for use in differential abundance methods

### First let's load libraries we'll need. If you don't have these already installed, you might have to install them
library(tidyverse)
library(Matrix)
library(remotes)
# install radEmu (only if you don't already have it installed)
if (!("radEmu" %in% row.names(installed.packages()))) {
  install_github("https://github.com/statdivlab/radEmu")
}
library(radEmu)

metadata <-
  read_csv("https://raw.githubusercontent.com/statdivlab/stamps2023/main/labs/radEmu-lab/data/wirbel_et_al_metadata.csv")

### Let's take a look at the data. Feel free to explore what variables are here using eg `colnames()`!
head(metadata)

### Let's see how many observations we have among cases ("CRC") and
### controls ("CTR")
metadata %>%
  group_by(Group) %>%
  summarize(n = length(Group))

### metadata$Group tells us which participants are cases and which are controls

### We have data from studies in 5 different countries
### How much from each study, you ask? Let's find out!
metadata %>%
  group_by(Country) %>%
  summarize(n = length(Group))

### Let's see how many cases and controls were enrolled in each study as well
metadata %>%
  with(table(Country, Group))

### Now let's load the mOTU table
mOTU_table <-
  read_csv("https://raw.githubusercontent.com/statdivlab/stamps2023/main/labs/radEmu-lab/data/wirbel_et_al_mOTUs.csv")

# let's take a peek at the mOTU table
head(mOTU_table)
# how are the columns in this object named?
# (Looking at the metadata again may be informative)

### column "X1" of our mOTU table contains names of our mOTUs
### let's store them in a separate object
mOTU_names <- mOTU_table$mOTU

### now we'll clean up the mOTU table a bit
mOTU_table <- mOTU_table %>%
  dplyr::select(-1) %>% #remove first column (mOTU names)
  as.matrix() %>% # make this whole deal a matrix
  t() %>% # transpose so that rows are samples and columns are mOTU names
  (Matrix::Matrix) #now make our transposed matrix into a ~fancy~ Matrix
# using the Matrix package

### we need to keep track of which columns of our transposed mOTU_table
### correspond to which mOTUs, so let's name them!
colnames(mOTU_table) <- mOTU_names

### we're not *quite* done with data cleaning and organization
### first, if we compare the dimensions of mOTU_table and metadata...
dim(mOTU_table)
dim(metadata)

### yikes -- we have more measurements than metadata!
### in general, this would require further investigation...
### but you can take my word that we don't need the extra observations
### let's get rid of them!

rows_in_metadata <- sapply(rownames(mOTU_table),
                           function(x) x %in% metadata$Sample_ID)

mOTU_table <- mOTU_table[rows_in_metadata,]
### let's check our dimensions again now
dim(mOTU_table)
dim(metadata)

### success!
### let's check that rows of metadata and mOTU_table are in the same order
order_for_metadata <- sapply(rownames(mOTU_table),
                             function(x) which(x == metadata$Sample_ID))

# if rows are in same order, this plot should look like y = x
plot(order_for_metadata)

# ok nope - not in same order, so let's reorder:
metadata <- metadata[order_for_metadata,]

# now let's check our work and look at the ordering again
order_for_metadata <- sapply(rownames(mOTU_table),
                             function(x) which(x == metadata$Sample_ID))

plot(order_for_metadata)

# yay

### if we look at the first few rownames of mOTU_table and sample_IDs
### in metadata, we see the order also matches

metadata$Sample_ID %>% head
rownames(mOTU_table) %>% head

### one last data organizing / cleaning / beautifying step...
### Wirbel et al. published their mOTU table...
### after dividing each row (i.e., count data for each sample) by its total
### and then rounding anything below 1e-6 to zero
### ... please don't do this! Publish your count data as is -- other
### researchers can easily transform counts, but un-transforming is
### harder!

### in any case, Wirbel et al. also published library size (i.e., count totals)
### by sample, so we can recover most of the count data by multiplying each
### row by the corresponding library size (we still don't get back counts that
### were rounded to zero)

for(i in 1:nrow(mOTU_table)){
  mOTU_table[i,] <- mOTU_table[i,]*metadata$Library_Size[i]
}

### Now we'll pull out some bacterial species we are interested in.
### Note that we could definitely fit a model to all mOTUs, but it takes longer,
### and isn't well suited to seeing these methods for the first time :)
fuso <- sapply(mOTU_names,function(x) grepl("Fusobac",x,fixed = TRUE))
prevo <- sapply(mOTU_names,function(x) grepl("Prevotella ",x,fixed = TRUE))
porph <-  sapply(mOTU_names,function(x) grepl("Porphyromonas ",x,fixed = TRUE))
clostr <- sapply(mOTU_names, function(x) grepl("Clostridium",x, fixed = TRUE))
faecali <- sapply(mOTU_names, function(x) grepl("Faecalibact",x, fixed = TRUE))
eubact <- sapply(mOTU_names, function(x) grepl("Eubact",x, fixed = TRUE))

# take a look at "eubact" -- what does this look like? What information
# does it contain?

head(eubact)

### "Group" in metadata indicates whether each participant is case (diagnosed
### with colorectal cancer) or a control (no such diagnosis)
### Let's make this a factor
unique(metadata$Group)

### make "CTR" the reference category (i.e., the first category)
metadata$Group <- factor(metadata$Group,
                         levels = c("CTR","CRC"))

### we'll stick to the genera Eubacterium, Porphyromonas, Faecalibacteria,
### and Fusobacterium for now
# store names of the mOTUs in these genera
restricted_mOTU_names <- mOTU_names[eubact|porph|fuso|faecali]

# Figure out which columns of mOTU_table contain observations in
# Eubacterium, Porphyromonas, Faecalibacteria, or Fusobacterium
which_mOTU_names <- which(eubact|porph|fuso|faecali)

# Rather than looking at all of the observations (again, just to constrain scope while we're learning)
# we'll begin by just looking at data from a study of Chinese cases and controls
ch_study_obs <- which(metadata$Country %in% c("CHI"))

# let's fit a model!
# ... emuFit will talk at you a bit, but don't worry about it
ch_fit <-
  emuFit(formula = ~ Group, # this is a formula telling radEmu what predictors to
         # use in fitting a model. Recall Group is an indicator for which
         # participants have CRC diagnoses and which do not.
         data = metadata[ch_study_obs, ], # we're only looking at rows
         # containing observations from the Chinese study
         # data contains our predictor data
         Y = as.matrix(mOTU_table[ch_study_obs, which_mOTU_names]), # which_mOTU_names <=> we're only looking at a subset of strains
         # We're looking at to Eubacterium, Faecalibacterium,
         # Porphyromonas, and Fusobacterium
         # (which_mOTU_names was constructed above)
         run_score_tests = FALSE)

### NOTE!!! If you got an error like "object 'Csparse_transpose' not found", then please
### run the following 3 lines of code. You should then be able to rerun the above.
# remove.packages("MatrixExtra")
# install.packages("MatrixExtra") ## type "Yes" and ENTER if prompted to install from source
# library(MatrixExtra)

ch_fit

### Let's look at estimates and confidence intervals for the effect of Group!
ch_fit$coef %>% head

adw96 commented 5 months ago

Also, could category_num be printed like an integer (rather than a double) and similarly could category be a string? (if the categories have names) 🙏

svteichman commented 5 months ago

Yes I'll work on this!

adw96 commented 5 months ago

The workshop is 12-2 today, so let's not have any changes to main in the interim, please 🙏 but soon after would be great! (I'm giving a couple of workshops this quarter and will reuse materials)

svteichman commented 5 months ago

When I run this on my computer with a fresh install of statdivlab/radEmu I get GroupCRC listed as the covariate, category_num given as an integer, and category as a string for both ch_fit and ch_fit$coef %>% head. Going to the commit history for print.emuFit, it looks like the things you are seeing were issues with the first commit but fixed in the second. However, I assume that you are using a version of radEmu that has been updated since then, so maybe it's an issue with computer settings somehow. I'll look into it further.

adw96 commented 5 months ago

My apologies for the bother, @svteichman . This all looks great now!

statdivlab / radEmu

would love more descriptive column `covariate` #43