markziemann / Gene-function-imputation

Gene function imputation by coexpression
3 stars 0 forks source link

Table of Contents

Project Files

  1. Data Preparation
    1A. RNASeq Data a. Quality Control b. Aggregating SRX to SRR c. Normalisation d. Clustering

    1B. GO Annotation Data a. Binary clasiffication b. Training and Testing Data Split

  2. Imputation
    2A. Correlation per cluster
    2B. Assigning weight to GO terms
    2C. Assigning a threshold
    2D. Imputation process
  3. Optimisation
    3A. Parameters to be optimised
    3B. Kfold validation
    3C. Prediction Scores
    a. Confusion Matrix
  4. Functions
    4A. Data preparation functions
    4B. Imputation functions
    4C. Optimisation functions



Project Files

How to run the algorithm:

  1. Create a folder named "Data" (case sensitive)
  2. Open and run YeastDataPrep.Rmd
  3. Open and run YeastImputation-Blinded.Rmd

Note that the optimise_impute function which performs the coarse sweep and kfold validation on Chunk 1 of YeastImputation-Blinded.Rmd have run times that last up to 6 hours because of Cluster 2 having ~2000 gene members.

For a faster runtime, search "runtime" on functions.r and un-comment the line right below and run the source("functions.R") line.

1. Data Preparation

Data preparation is divided into two main processes: RNASeq counts data processing and Gene Ontology Annotation data processing. The processed data are saved as .rds files to be used in the downstream processes.

1A. RNASeq Data

Saccharomyces cerevisiae count and quality metrics data were downloaded from DEE2

Downloaded count data in long format is converted to wide format for easier matrix manipulation.

1Aa. Quality Control

The quality data allows for filtering high quality datasets. Databases were filtered and quantified according to QC_SUMMARY = PASS, WARN, or FAIL. For all the downstream processes, only datasets tagged "PASSED" are used.

1Ab. Aggregating SRR to SRX

Sequence Read Archive (SRA) Accession Codes are also provided in the quality metrics data. These codes describes the source and type of the dataset.
Accession code format: aRbx (Examples: SRR#, SRX#, ERR#, DRR#, etc.)
Where:
First letter (represented by 'a' above) Represents the source database to which the sample was originally uploaded, before being synchronised with the other 2 databases:

 S – NCBI’s SRA database</br>
 E – EBI’s database</br>
 D – DDBJ database</br>

Second Letter

Always the letter R</br>

Third letter (represented by 'b' above) Type of data represented:

R – Run</br>
X – Experiment</br>
S – Sample</br>
P – Project / study</br>

x - unique accession code
Some datasets have multiple runs (SRR) of the same experiment (SRX). Multiple runs, if any, were aggregated to its corresponding experiment using the srx_agg function.

1Ac. Normalisation

Library size bias was accounted for via TMM normalisation using the DESeq2 library and then scaled by R's base function scale().

1Ad. Clustering

After data quality control and normalisation processes, the counts data is clustered using the hierarchical agglomerative cluster analysis (HAC) with the hclust function

hclust uses Agglomerative clustering also known as AGNES (Agglomerative Nesting) that outputs a tree which can be plotted as a dendrogram. It is good at identifying small clusters.

Spearman correlation was used to compute the distance matrix

The complete (maximum) linkage clustering was used to measure the dissimilarity between two clusters of observations. This is also referred to as the linkage method.

Note: Maximum or complete linkage clustering computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters which is ideal in determining close relationships between gene expressions. (Reference is the link above).

Note2: The more tight, dense are clusters inside and the less density is outside of them (or the wider apart are the clusters) - the greater is the internal validity.

Reference

A series of cluster sizes (total clusters or K) from 50 to 2000 is passed to the cl_lengthCut function which outputs a list all the gene IDs tagged to belong to a cluster.

1B. GO Annotation Data

The Gene Ontology Annotation File (GAF) format of yeast was downloaded from geneontology.org.

1Ba. Binary clasiffication

A binary matrix was constructed with the Entrez Gene IDs (NCBI gene IDs) as rownames and GO IDs as the column names (wide format). If a GO ID is associated with a Gene ID, the cell will equal to 1, otherwise it will be zero.
There were Gene IDs present in the GAF that is not present in the counts data. These were removed.

1Bb. Training and Testing Data Split

Ten per cent of the gene IDs from the binary GO matrix will be set aside as testing data and the remaining 90 per cent will be used as training data for the algorithm.

2. Imputation

Cluster Level Analysis

This explanation focuses in one cluster. This algorithm will be repeated for the rest of the clusters.

2A. Correlation per cluster

The correlation of the count data for all Gene IDs belonging to cluster 1 is computed using the Spearman method. This produces an n x n matrix with rows and columns as Gene IDs.

2B. Assigning weight to GO terms in the cluster

The correlation matrix is turned into an edge list using the igraph library. An edge list contains pairs of genes with their corresponding correlation value. Each gene ID is filtered from this list along with the gene IDs it has paired with. The correlation value will be treated as the weight of the gene to gene connection resulting in a network for each cluster.

The GO terms for the cluster is filtered and these values will be multiplied across the 1s and 0s of each GO ID column.

2C. Assigning a threshold

After assigning the weights, the values will averaged (still figuring out the optimal method) and a threshold will determine a cutoff value for the imputation process.

2D. Imputation process

All values euqal to and above the threshold value will be assigned as 1 (meaning the gene is annotated with the GO ID) and all values below will be assigned 0 (meaning the gene is does not belong to the GO ID). This will be repeated with all the gene IDs until a new GO annotation matrix with the same dimensions as the original one but with imputed values is created.

3. Optimisation

The imputation process will then be optimised using the techniques below.

3A. Parameters to be optimised

A series of cluster sizes and the threshold values is used to find the optimal parameters for imputation.

3B. Kfold validation

Kfold cross validation at K = 10 is used to see if the parameters used yield optimal values. The GO annotation for a cluster is divided into 10 parts. Part 1 of the gene IDs will have their annotations set to zero and this dataset will be used to run the first fold of the cross validation. Parts 2 to 10 will be used in the same manner and the process is repeated until it reaches 10 folds.

3C. Prediction Scores and the Confusion Matrix

To measure the performance of each parameter pair, Accuracy, Precision, Recall, and the F1 score will be measured using a confusion matrix.

Confusion Matrix

4. Functions

4A. Data preparation functions

srx_agg - Multiple runs, if any, are aggregated to its corresponding experiment. Function from https://github.com/markziemann/getDEE2/issues/7#issuecomment-639290203


cl_lengthCut - This function's output is a list grouped by total clusters which gives (1) a table of GeneIDs assigned to a cluster number, (2) the cutree denominator to achieve that total cluster, and (3) the tally of the total Gene IDs belonging to a cluster number.

4B. Imputation functions

wcorr_cluster - assigns "weights" to the GO matrix of a cluster by multiplying all the correlation values of a gene from the edge_list() output; it then takes a weighted average of the matrix to be imputed later with a threshold value. This function's output is a nested list of genes belonging to a cluster with a tally of how many correlation values passed the threshold that connects the gene to a GO Term.


impute - takes the output from wcorr_cluster(), applies the threshold, and assigns 1 to values equal to or above the cut and zero if otherwise. This produces a binary matrix similar to the original GO annotation matrix but the assignment of 1s and 0s is created by the algorithm: genes that belongs to a GO ID is marked 1 and 0 if not. This function's output is a nested list of binary matrices for (1) the original GO annotations of genes for a cluster, (2) imputed GO binary matrix, (3) blinded GO matrix from cross_val(), and (4) a matrix from matrix 4 subtracted by matrix 1

4C. Optimisation functions

optimise_impute - takes a series of cluster sizes and thresholds and applies each pair to a cross validation process. This function outputs a nested list of prediction scores from a 10-fold validation process. Scores are grouped by a parameter pair consisting of a total cluster and a threshold value. Each threshold value will be applied to each total cluster value.


corr_per_clust - takes the counts data of all genes in a cluster to make a correlation matrix. This function's output is a nested list of the genes grouped per cluster and their corresponding correlation matrices.


GO_per_cl - subsets the bigger GO binary annotation matrix into a smaller matrix that only contains gene IDs and corresponding GO IDs of a cluster. This function's output is a list of data frames containing all GO terms associated with the genes belonging to a particular cluster.


GO_per_cl_list - lists down all the GO IDs of a cluster; filters out all columns (GO IDs) that sums up to zero to reduce computing load. This function's output is a nested list of GO terms (column names) grouped per cluster from GO_per_cl()


mean_Csweep - This function's output is a dataframe of the average value of a specific measure of performance from all cross validation folds


cross_val - performs a kfold cross validation. This function's output is a list of the overall performance measures of all folds from stats_all() and a dataframe containing the genes assigned to each fold.


GO_per_cl_blinded - a derivative of GO_per_cl, this function also subsets from the a bigger blinded GO annotation matrix during cross validation. Instead of taking out zero sum columns, this function makes sure that the column names of the original GO matrix subset is equal to the blinded subset for dimensional consistency. This function's output is a list of data frames containing all GO terms (from the blinded db) associated with the genes belonging to a cluster.


edge_list - transforms all the correlation martrices of each cluster into a an edge list. This function's output is a list that contains a dataframe with the pair of genes that makes up an edge (listed from and to) with a corresponding correlation value treated as the edge weight. This list is created using the igraph library.


stats_cl - calculates the performance per fold during cross validation (see cross_val). This function's output is a nested list of performance scores grouped per cluster.


stats_all - This function's output is a list of values measuring the performance of the whole imputation using the blinded and original data frames.