Data Preparation 1A. RNASeq Data a. Quality Control b. Aggregating SRX to SRR c. Normalisation d. Clustering
1B. GO Annotation Data a. Binary clasiffication b. Training and Testing Data Split
How to run the algorithm:
Note that the optimise_impute
function which performs the coarse sweep and kfold validation on Chunk 1 of YeastImputation-Blinded.Rmd have run times that last up to 6 hours because of Cluster 2 having ~2000 gene members.
For a faster runtime, search "runtime" on functions.r and un-comment the line right below and run the source("functions.R")
line.
Data preparation is divided into two main processes: RNASeq counts data processing and Gene Ontology Annotation data processing. The processed data are saved as .rds files to be used in the downstream processes.
Saccharomyces cerevisiae count and quality metrics data were downloaded from DEE2
Downloaded count data in long format is converted to wide format for easier matrix manipulation.
The quality data allows for filtering high quality datasets. Databases were filtered and quantified according to QC_SUMMARY = PASS, WARN, or FAIL. For all the downstream processes, only datasets tagged "PASSED" are used.
Sequence Read Archive (SRA) Accession Codes are also provided in the quality metrics data. These codes describes the source and type of the dataset. Accession code format: aRbx (Examples: SRR#, SRX#, ERR#, DRR#, etc.) Where: First letter (represented by 'a' above) Represents the source database to which the sample was originally uploaded, before being synchronised with the other 2 databases:
S – NCBI’s SRA database</br>
E – EBI’s database</br>
D – DDBJ database</br>
Second Letter
Always the letter R</br>
Third letter (represented by 'b' above) Type of data represented:
R – Run</br>
X – Experiment</br>
S – Sample</br>
P – Project / study</br>
x - unique accession code Some datasets have multiple runs (SRR) of the same experiment (SRX). Multiple runs, if any, were aggregated to its corresponding experiment using the srx_agg function.
Library size bias was accounted for via TMM normalisation using the DESeq2
library and then scaled by R's base function scale()
.
After data quality control and normalisation processes, the counts data is clustered using the hierarchical agglomerative cluster analysis (HAC) with the hclust
function
hclust
uses Agglomerative clustering also known as AGNES (Agglomerative Nesting) that outputs a tree which can be plotted as a dendrogram. It is good at identifying small clusters.
Spearman correlation was used to compute the distance matrix
The complete (maximum) linkage clustering was used to measure the dissimilarity between two clusters of observations. This is also referred to as the linkage method.
Note: Maximum or complete linkage clustering computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters which is ideal in determining close relationships between gene expressions. (Reference is the link above).
Note2: The more tight, dense are clusters inside and the less density is outside of them (or the wider apart are the clusters) - the greater is the internal validity.
A series of cluster sizes (total clusters or K) from 50 to 2000 is passed to the cl_lengthCut
function which outputs a list all the gene IDs tagged to belong to a cluster.
The Gene Ontology Annotation File (GAF) format of yeast was downloaded from geneontology.org.
A binary matrix was constructed with the Entrez Gene IDs (NCBI gene IDs) as rownames and GO IDs as the column names (wide format). If a GO ID is associated with a Gene ID, the cell will equal to 1, otherwise it will be zero. There were Gene IDs present in the GAF that is not present in the counts data. These were removed.
Ten per cent of the gene IDs from the binary GO matrix will be set aside as testing data and the remaining 90 per cent will be used as training data for the algorithm.
Cluster Level Analysis
This explanation focuses in one cluster. This algorithm will be repeated for the rest of the clusters.
The correlation of the count data for all Gene IDs belonging to cluster 1 is computed using the Spearman method. This produces an n x n matrix with rows and columns as Gene IDs.
The correlation matrix is turned into an edge list using the igraph
library. An edge list contains pairs of genes with their corresponding correlation value. Each gene ID is filtered from this list along with the gene IDs it has paired with. The correlation value will be treated as the weight of the gene to gene connection resulting in a network for each cluster.
The GO terms for the cluster is filtered and these values will be multiplied across the 1s and 0s of each GO ID column.
After assigning the weights, the values will averaged (still figuring out the optimal method) and a threshold will determine a cutoff value for the imputation process.
All values euqal to and above the threshold value will be assigned as 1 (meaning the gene is annotated with the GO ID) and all values below will be assigned 0 (meaning the gene is does not belong to the GO ID). This will be repeated with all the gene IDs until a new GO annotation matrix with the same dimensions as the original one but with imputed values is created.
The imputation process will then be optimised using the techniques below.
A series of cluster sizes and the threshold values is used to find the optimal parameters for imputation.
Kfold cross validation at K = 10 is used to see if the parameters used yield optimal values. The GO annotation for a cluster is divided into 10 parts. Part 1 of the gene IDs will have their annotations set to zero and this dataset will be used to run the first fold of the cross validation. Parts 2 to 10 will be used in the same manner and the process is repeated until it reaches 10 folds.
To measure the performance of each parameter pair, Accuracy, Precision, Recall, and the F1 score will be measured using a confusion matrix.
Predicted values = the GO matrix with the imputed values after running the algorithm.
Predicted 1 | Predicted 0 | |
---|---|---|
Actual 1 | True Postivie (TP) | False Negative (FN) |
Actual 0 | False Positive (FP) | True Negative (TN) |
To solve for TP and TN the following equation is used: TP or TN = Actual + Predicted
Addition | Predicted 1 | Predicted 0 |
---|---|---|
Actual 1 | $\color{red}{\text{2}}$ | 1 |
Actual 0 | 1 | $\color{red}{\text{0}}$ |
To solve for FP and FN the following equation is used: FP or FN = Actual - Predicted
Subtraction | Predicted 1 | Predicted 0 |
---|---|---|
Actual 1 | 0 | $\color{red}{\text{1}}$ |
Actual 0 | $\color{red}{\text{-1}}$ | 0 |
Therefore, via matrix addition and subtraction:
diff <- Actual - Predicted
sum <- Actual + Predicted
TP <- length(which(sum == 2)) #True positive
TN <- length(which(sum == 0)) #True negative
FP <- length(which(diff == -1)) #False positive
FN <- length(which(diff == 1)) #False negative
# Precision/Positive Predictive Value (PPV)
PPV <- TP/(TP+FP)
# Accuracy (ACC)
ACC <- (TP+TN)/(TP+TN+FP+FN)
# F1 score (is the harmonic mean of precision and sensitivity)
F1 <- (2*TP)/((2*TP)+FP+FN)
# Recall
Recall <- TP/(TP+FN)
srx_agg - Multiple runs, if any, are aggregated to its corresponding experiment. Function from https://github.com/markziemann/getDEE2/issues/7#issuecomment-639290203
cl_lengthCut - This function's output is a list grouped by total clusters which gives (1) a table of GeneIDs assigned to a cluster number, (2) the cutree denominator to achieve that total cluster, and (3) the tally of the total Gene IDs belonging to a cluster number.
hclust()
of the stats librarywcorr_cluster - assigns "weights" to the GO matrix of a cluster by multiplying all the correlation values of a gene from the edge_list()
output; it then takes a weighted average of the matrix to be imputed later with a threshold value. This function's output is a nested list of genes belonging to a cluster with a tally of how many correlation values passed the threshold
that connects the gene to a GO Term.
edge_list()
impute - takes the output from wcorr_cluster()
, applies the threshold, and assigns 1 to
values equal to or above the cut and zero if otherwise. This produces a binary matrix similar to the original GO annotation matrix but the assignment of 1s and 0s is created by the algorithm: genes that belongs to a GO ID is marked 1 and 0 if not.
This function's output is a nested list of binary matrices for (1) the original GO annotations of genes for a cluster, (2) imputed GO binary matrix, (3) blinded GO matrix from cross_val()
, and (4) a matrix from matrix 4 subtracted by matrix 1
GO_per_cl()
GO_per_cl_blinded()
wcorr_cluster
optimise_impute - takes a series of cluster sizes and thresholds and applies each pair to a cross validation process. This function outputs a nested list of prediction scores from a 10-fold validation process. Scores are grouped by a parameter pair consisting of a total cluster and a threshold value. Each threshold value will be applied to each total cluster value.
cl_lengthCut()
which gives a table of GeneIDs assigned to clusterscorr_per_clust
GO_per_cl
GO_per_cl_list
cross_val
corr_per_clust - takes the counts data of all genes in a cluster to make a correlation matrix. This function's output is a nested list of the genes grouped per cluster and their corresponding correlation matrices.
cl_lengthCut()
GO_per_cl - subsets the bigger GO binary annotation matrix into a smaller matrix that only contains gene IDs and corresponding GO IDs of a cluster. This function's output is a list of data frames containing all GO terms associated with the genes belonging to a particular cluster.
cl_lengthCut()
GO_per_cl_list - lists down all the GO IDs of a cluster; filters out all columns (GO IDs) that sums up to zero to reduce computing load. This function's output is a nested list of GO terms (column names) grouped per cluster from GO_per_cl()
GO_per_cl()
mean_Csweep - This function's output is a dataframe of the average value of a specific measure of performance from all cross validation folds
Inputs:
cross_val()
cross_val()
output list indicating the measure of performance to be averaged from 2 to 11 in the ff order: Total Positive (TP), Total Negative (TN), False Positive (FP), False Negative (FN), Sensitivity (TPR), Specificity (TNR), Precision (PPV), F1 Score (F1), and Recall
summary_Csweep - gets all the average values from mean_Csweep
and creates a summary. This function's output is a dataframe.
Inputs:
cross_val()
Dependencies:
mean_Csweep
cross_val - performs a kfold cross validation. This function's output is a list of the overall performance measures of all folds from stats_all()
and a dataframe containing the genes assigned to each fold.
GO_per_cl()
GO_per_cl_blinded
edge_list
impute
stats_cl
stats_all
GO_per_cl_blinded - a derivative of GO_per_cl
, this function also subsets from the a bigger blinded GO annotation matrix during cross validation. Instead of taking out zero sum columns, this function makes sure that the column names of the original GO matrix subset is equal to the blinded subset for dimensional consistency.
This function's output is a list of data frames containing all GO terms (from the blinded db) associated with the genes belonging to a cluster.
GO_list_perCl
edge_list - transforms all the correlation martrices of each cluster into a an edge list. This function's output is a list that contains a dataframe with the pair of genes that makes up an edge (listed from and to) with a corresponding correlation value treated as the edge weight. This list is created using the igraph library.
corr_per_clust()
stats_cl - calculates the performance per fold during cross validation (see cross_val
). This function's output is a nested list of performance scores grouped per cluster.
impute()
stats_all - This function's output is a list of values measuring the performance of the whole imputation using the blinded and original data frames.
stats_cl