santina / team_Undecided

2 stars 0 forks source link

Progress Report for Team_Undecided #4

Open pan-chu opened 7 years ago

pan-chu commented 7 years ago

9de029f9d00e2957e001042724053c9fbbfbe400

Progress Report

@ppavlidis @singha53 @farnushfarhadi @santina

santina commented 7 years ago

Hi @STAT540-UBC/team-undecided,

Thanks for the detailed progress report. Here are my few comments:

The exact normalization methods the data deposited on GEO has undergone are not clear, so we are clarifying with the authors.

That's awesome. Be sure to mention how it was done in the poster.

... results in an imbalanced dataset. We will likely perform bootstrapping to resolve this issue.

I don't understand how bootstrapping helps with imbalanced dataset ... How are those two concepts related?

However, our preliminary investigaton using limma suggest that very few genes are differentially expressed; at a significant value of p-value = 0.05, only 6 genes were found. This may not give us enough power to detect differences. A more promising approach is to look at genes that are associated with relevant biological pathways. These pathways may be identified by quickly checking the Gene Ontology for the differentially expressed genes. ... We will attempt to maximize the number of genes we can look at given our resources, maybe ~100 genes or so.

Since your GitHub was organized by people's names rather than the steps in the analysis, it's difficult to find the code for the analysis on RNAseq. It'd be interesting to see how that part was done. Here's a practice on RNAseq that I made that might be helpful, in addition to seminar 7.
It also sounds like you might go with the genes in relevant biological pathways based on the literature rather than your analysis? Could you clarify what you meant here. Do you mean you'll go with the 6 genes and all their related genes, so there will be more genes (~100) you can look at, even though they might not be differentially expressed. I'm kind of confused here.

It looks like the current step is at understanding WGCNA better. Here's a tutorial that might help (though maybe you already have this).

Another thing that wasn't clear to me is the how you plan on integrating the result you get from methylation and RNAseq data together.

Overall I think this is pretty detailed report. It's clear that you put a lot of thoughts into it. I haven't been as closely monitoring your project as @singha53 so he probably will have more/better comments about your progress report. 😄

pan-chu commented 7 years ago

Hi Santina!

Thanks for the prompt feedback.

I might be mistaken, correct me if so! By bootstrapping we really mean sampling. Maybe we can synthesize additional sample values by sampling the existing ones? I admit these results are super new (as of yesterday) so we haven't had the chance to give enough thoughts on how to handle the imbalanced data. Do you have more resources / suggestions for us to consider?

So far, we have been investigating our own individual components and hence the directory organization. We are looking to have something more integrated and intuitive by the end.. sorry about the messiness! - hopefully, we’ll have a automatic pipeline that can perform the entire analysis in a single call.

I didn’t get a chance last night to prepare the markdown file for the preliminary differential network analysis & differential expression analysis using limma. I have pushed the .md file now, please take a look. Preliminary dina analysis

My mistake - the 6 genes were identified using the stringent FDR value of 0.05 - which is probably unlikely to yield a large number of genes anyway. We could potentially raise FDR to obtain more genes for analysis. But our plan, as it stands, is to identify relevant pathways and use the genes that are associated with these pathways. The pathways would be identified by doing gene function analysis using GO using the most likely differentially expressed genes. So.. to sum it up


  1. identify the differentially expressed genes
  2. identify the pathways associated with the highly (low p-value / high effect) genes
  3. validate that these pathways are important in asthma by doing literature review
  4. use the genes associated with these pathways for differential network analysis

For the integration of RNA-seq and methylation data, we will:

  1. map methylation sites to genes
  2. for each gene, we will have 2 vectors (expression level across the samples & methylation level across the samples)
  3. networks will be built using these 2 different data types
    1. each “vector” consisting of either expression values or methylation values will be a node in the network
    2. connections between nodes will be simply pearson's correlation coefficient
    3. so in the network, some node will correspond to expression data, some will be methylation data

Thanks again for your feedback!! :)

singha53-zz commented 7 years ago

@STAT540-UBC/team-undecided some comments and suggests

Aim 1: Data preprocessing

implement Hidden Covariates with Prior algorithm (Mostafavi, et al, written in matlab) to adjust for known (GC bias, age, sex) and unknown (comorbidities, batch effects) covariates.

Aim 2: Patient clustering

I don’t quite agree with the clustering analysis:

Aim 3: Hypothesis testing

Gene filtering

Network construction

Looking at: https://github.com/STAT540-UBC/team_Undecided/blob/master/Arjun_Scripts/WGCNA.Rmd

Differential expression analysis using Limma

Looking at; https://github.com/STAT540-UBC/team_Undecided/blob/master/Eric_Scripts/preliminary_dna_analysis.md