greenelab / core-accessory-interactome

Investigating the functional relationship between P. aeruginosa core and accessory genes.
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

Apply correlation correction #25

Closed ajlee21 closed 3 years ago

ajlee21 commented 3 years ago

Previously we noticed that when we apply clustering on gene expression correlation matrix, gene pairs tended to cluster into a single large module. This observation is consistent with a previous study, which found that KEGG (a database that containes genes or proteins annotated with specific biological processes as reported in the literature) is bias in some biological processes represented. Figure 1C demonstrates that a large fraction of gene pairs are ribosomal relationships - in the top 0.1% most co-expressed genes, 99% belong to the ribosome pathway. Furthermore, protein function prediction based on co-expression drop dramatically after removing the ribisome pathway (Figure 1A, B). This finding is consistent with our observation when we calculate the correlation of the raw gene expression data. We found one large highly correlated module that is likely driven by genes related to a single biological process.

Challenge: This very dominant global signal can mask more specific signals in the data. In this PR we tried several approaches to extracting correlations between genes that correct for this:

  1. Transformed raw data and then apply correlation
  2. Apply dimensional reduction using PCA or SVD (SPELL) on raw data and then apply correlation
  3. Scale the high-degreeness in the correlation matrix (SEEK/Hetio)

Main changes:

  1. Performed exploration of different correlation correction methods listed above. The notebooks can be found in archive and the results can be found here: https://docs.google.com/presentation/d/1mLdLk6j3C-XyoxsKvJ0Db7_cbNlMZi7eaNrCH6O8LcM/edit#slide=id.p

  2. Final correlation correction analysis using SPELL method: 1_correlation_analysis.ipynb

There is a lot happening in (1) but most of the code is redundant to what is seen in (2) and I wanted to keep the (1) notebooks for reference but shouldn't need much of a review. I would focus on the changes in (2)