Progress Report for Team_Undecided

9de029f9d00e2957e001042724053c9fbbfbe400

@ppavlidis @singha53 @farnushfarhadi @santina

Hi @STAT540-UBC/team-undecided,

Thanks for the detailed progress report. Here are my few comments:

The exact normalization methods the data deposited on GEO has undergone are not clear, so we are clarifying with the authors.

That's awesome. Be sure to mention how it was done in the poster.

... results in an imbalanced dataset. We will likely perform bootstrapping to resolve this issue.

I don't understand how bootstrapping helps with imbalanced dataset ... How are those two concepts related?

However, our preliminary investigaton using limma suggest that very few genes are differentially expressed; at a significant value of p-value = 0.05, only 6 genes were found. This may not give us enough power to detect differences. A more promising approach is to look at genes that are associated with relevant biological pathways. These pathways may be identified by quickly checking the Gene Ontology for the differentially expressed genes. ... We will attempt to maximize the number of genes we can look at given our resources, maybe ~100 genes or so.

Since your GitHub was organized by people's names rather than the steps in the analysis, it's difficult to find the code for the analysis on RNAseq. It'd be interesting to see how that part was done. Here's a practice on RNAseq that I made that might be helpful, in addition to seminar 7.
It also sounds like you might go with the genes in relevant biological pathways based on the literature rather than your analysis? Could you clarify what you meant here. Do you mean you'll go with the 6 genes and all their related genes, so there will be more genes (~100) you can look at, even though they might not be differentially expressed. I'm kind of confused here.

It looks like the current step is at understanding WGCNA better. Here's a tutorial that might help (though maybe you already have this).

Another thing that wasn't clear to me is the how you plan on integrating the result you get from methylation and RNAseq data together.

Overall I think this is pretty detailed report. It's clear that you put a lot of thoughts into it. I haven't been as closely monitoring your project as @singha53 so he probably will have more/better comments about your progress report. 😄

Hi Santina!

Thanks for the prompt feedback.

I might be mistaken, correct me if so! By bootstrapping we really mean sampling. Maybe we can synthesize additional sample values by sampling the existing ones? I admit these results are super new (as of yesterday) so we haven't had the chance to give enough thoughts on how to handle the imbalanced data. Do you have more resources / suggestions for us to consider?

So far, we have been investigating our own individual components and hence the directory organization. We are looking to have something more integrated and intuitive by the end.. sorry about the messiness! - hopefully, we’ll have a automatic pipeline that can perform the entire analysis in a single call.

I didn’t get a chance last night to prepare the markdown file for the preliminary differential network analysis & differential expression analysis using limma. I have pushed the .md file now, please take a look. Preliminary dina analysis

My mistake - the 6 genes were identified using the stringent FDR value of 0.05 - which is probably unlikely to yield a large number of genes anyway. We could potentially raise FDR to obtain more genes for analysis. But our plan, as it stands, is to identify relevant pathways and use the genes that are associated with these pathways. The pathways would be identified by doing gene function analysis using GO using the most likely differentially expressed genes. So.. to sum it up…

identify the differentially expressed genes
identify the pathways associated with the highly (low p-value / high effect) genes
validate that these pathways are important in asthma by doing literature review
use the genes associated with these pathways for differential network analysis

For the integration of RNA-seq and methylation data, we will:

map methylation sites to genes
for each gene, we will have 2 vectors (expression level across the samples & methylation level across the samples)
networks will be built using these 2 different data types
1. each “vector” consisting of either expression values or methylation values will be a node in the network
2. connections between nodes will be simply pearson's correlation coefficient
3. so in the network, some node will correspond to expression data, some will be methylation data

Thanks again for your feedback!! :)

@STAT540-UBC/team-undecided some comments and suggests

Aim 1: Data preprocessing

implement Hidden Covariates with Prior algorithm (Mostafavi, et al, written in matlab) to adjust for known (GC bias, age, sex) and unknown (comorbidities, batch effects) covariates.

Is it possible for you to describe a little bit more of what this “Hidden covariates with prior algorithm” does? What is the underlying statistical methodology, what are its limitations, what type of data can it be used for? Does the algorithm require that you input your response variable so it doesn’t remove effects related to the variable of interest?
Also you state that you are adjusting for age, sex, GC bias, but can you show that there is an effect to begin with? e.g. you show the PCA plot and color by sex to show no clustering after batch correction, but what does this plot look like before batch correction?
Use: knitr::opts_chunk$set(warning = FALSE, message = FALSE) to hide all the warning messages on your html file
not sure how easy it is to add ellipses on your PCA, might be easier to clearly see clusters of points (look into ggbiplot R library (shown in the lecture))
Also keep your plots consistent, your PCA plots for gene expression had the proportion of variance explained but those for methylation do not.

Aim 2: Patient clustering

I don’t quite agree with the clustering analysis:

Based on your R-code, you are using count data; please use normalized data which can have an large impact on how objects cluster. E. g. a common normalization procedure performed by limma voom is to scale every count of a sample by the total library size: suppose that in sample 1 CLCA1 had 3000 counts and the total library size was 1M counts. Now suppose that in sample 2, CLCA1 had 6000 counts and the total library size was 2M. Essentially the other sample was sequenced more so the all genes have twice as many reads than sample 1. Therefore, after normalization both genes are equally expressed --> normalization is very important.
NOTE: normalize you entire dataset first then take out the expression data for CLCA1, SERPINB2 and periostin (don’t normalize them independently)
You have only a few samples with very high counts that is why your Th2 high group has very few subjects. log2 your data (limma voom also does this: look into the voom function from limma)
Lastly the scale of the genes has a large impact on how objects cluster. After you normalize your data --> standardize your genes (center and scale)
Then perform k-means clustering
Lastly, since you only have 3 genes, try making a 3d plot of your clusters where each axis is a gene.

Aim 3: Hypothesis testing

Gene filtering

at a significant value of fdr = 0.05, only 6 genes were found: you can relax your FDR, say 30% for differential network analysis since this is only for exploratory purposes.
Or try the q-value :)

Network construction

Hetergeous (spelling)
Please confirm if you are using normalized data for this step as well

Looking at: https://github.com/STAT540-UBC/team_Undecided/blob/master/Arjun_Scripts/WGCNA.Rmd

“At this point, I can make the modules, but I am not sure how to interpret them or even if the parameters I used were appropriate.” – WGCNA tutorials provide comprehensive details on this
“I am also confused about how WGCNA output will be appropriate with Eric’s DiNA part.” – Currently Eric is using differential expression analysis to feed into his differential network analysis. suppose if you only apply WGCNA to the normal subjects only and give Eric the list of genes that fall into each of those modules you identify then Eric can determine which modules have the most significant differential connectivity by comparing HC with TH2 low and TH2 high groups per module and determine some ranking for the modules with the greatest connectivity. (just one idea)

Differential expression analysis using Limma

Looking at; https://github.com/STAT540-UBC/team_Undecided/blob/master/Eric_Scripts/preliminary_dna_analysis.md

The usual limma pipeline does not apply to count data --> look into limma voom
"Differentially expressed genes are must more likely to also exhibit differential connectivity!" what do you mean by this? Differential expression and differential connectivity are two different things

santina / team_Undecided