Closed jaxels20 closed 1 year ago
Link prediction implementation and evaluation https://docs.dgl.ai/en/0.8.x/tutorials/blitz/4_link_predict.html
chatgpt: In general, vertex embeddings are more commonly used for link prediction tasks because they can capture local information about nodes and their relationships, which is often more informative for making link predictions. However, graph embeddings can be useful in some cases, especially when global information about the graph structure is important for making accurate predictions.
simpel approach:
Cannot get access to the paper, but they use layer attention graph convolutional network (LAGCN) for the drug–disease association prediction. They use both drug-diseaase, drug-drug, and disease-disease associations to predict drug-disease links. They improved the state-of-art methods by using LAGCN (ROC=0.8750), however it is from October 2020. https://academic.oup.com/bib/article-abstract/22/4/bbaa243/5918381?login=false#no-access-message
This book Graph Embedding for Pattern Analysis is a available through AUB online: https://kbdk-aub.primo.exlibrisgroup.com/discovery/fulldisplay?docid=alma9920605871405762&context=L&vid=45KBDK_AUB:AUB&lang=da&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=any,contains,Graph%20Embedding%20for%20Pattern%20Analysis&offset=0
The First chapter covers pattern detection and an explanation about Graph embedding. Furthermore it cover how explicit graph embedding has been approach by the following families of algorithms:
Moreover it covers multilevel analysis for explicit graph embedding which is a method of explicit graph embedding for attributed graphs with many symbolic as well as numeric attributes on both nodes and edges. Cite from the book: "Fuzzy Multilevel Graph Embedding (FMGE) is a straightforward, simple, and computational efficient solution for facilitating the use of graph-based powerful representations together with learning and computational strengths of state-of-the-art machine learning, classification, and clustering."
A Tutorial on Network Embeddings (Paper): https://arxiv.org/pdf/1808.02590.pdf
Gives a description of graph embedding. Provides eight methods for embedding and the embedding learning method within each method (page 8).
REDIRECTION: Generating drug repurposing hypotheses using link prediction with DISNET data (July 27, 2022): https://www.biorxiv.org/content/10.1101/2022.07.26.501105v1.full
They use two types of nodes: phenotypes (diseases and symptoms) and drugs. They use a two-layer convolutional approach, and gets ROC=0.93. They gives ideas on how to improve the model, such as use other types of nodes such as genes, proteins, metabolic pathways, drug-drug interactions, and so on.
This paper wants to achieve the same goal as us. https://www.frontiersin.org/articles/10.3389/fgene.2021.657182/full#B3
Instead of using Graph Embedding, they use graph convolutional neural network to pre-train and generate the graph. Then they use node2vec as network representation and Random forest classifier to obtain predictions.
I also think we could use this paper as some kind of inspiration for structure of the project.
Getting an overview of the architecture of a gnn: https://theaisummer.com/gnn-architectures/
This Article describes embeddings very well and explain different methods for Vetex embedding. Vertex embedding can be divided into factorization approaches, random walk approaches, and deep approaches.
https://towardsdatascience.com/graph-embeddings-the-summary-cc6075aba007
This article explains Graph Neural Networks (GNN). Furthermore it explains how Graph Convolutional Networks (GCN) works as building blocks for the Graph Autoencoder (GAE) architecture. It also explores and compare the GAE to the Variational Graph Autoencoder (VGAE). Overall, the article is good as it shows how this is implemented in Python and how to evaluate the method(s).
A simple explanation of GNN: https://theaisummer.com/Graph_Neural_Networks/
Link prediction on graph streams. http://www.charuaggarwal.net/ICDE16_research_307.pdf
Resume of Thesis: Robustness of Drug-Disease-Association Network Embeddings - Tenzen Rabgang
Two tasks are performed:
Two experiments are performed:
Graph Embeddings: https://academic.oup.com/bioinformatics/article/36/4/1241/5581350 (Paper) use several biomedical networks and apply 11 different embedding methods on them. Their research paper is especially important for the thesis as they also use the DDA data set and run different embedding methods on it. It is highly used to determine which embedding methods to test.
NDF-RT: They only use the relation May_treat - "The relations within and between the entities are very similar in their meaning; hence, we can presume a connection between them such as e.g., may treat pairs overlap with may prevent, or may diagnose pairs."
They find it easier to work with the data when converting from XML to OWL format, so they convert all 82 versions.
When they start from the premise of a directed drug-disease graph, drug nodes have on average approximately four outgoing links while disease nodes hold around 50 incoming links.
They leave out the MED-RT due to time limitations of making OWL representations.
Embedding Approach and Methods: They select 17 DDA Versions (every 5th) and create embeddings with dimension size 100 for 50 runs. The embedding dimension in graph analysis is typically determined by the user or chosen using heuristics, and it determines the size of the embedding space for the graph. Running the algorithm 50 times likely refers to performing the embedding process multiple times with different random initializations. Running the algorithm multiple times with different random seeds can help to generate more diverse embeddings and reduce the risk of getting stuck in a local optimum. When an embedding algorithm gets stuck in a local optimum, it means that it has converged to a suboptimal solution that may not accurately represent the underlying data.
For the additional experiment, they choose one version (2014.06.02) as the ground truth and select two versions (2009.01.03 and 2018.01.02) as synthetic ones with different noise levels. To compare them, they fist remove all nodes that do not appear in the ground-truth version. The same applies for the synthetic versions, where they remove nodes that exist in the ground-truth but not in the synthetic versions. In the end, the ground-truth and the synthetic versions comprise the same set of nodes but with different edges. We generate ten embeddings of the ground-truth and the synthetic versions.
They choose three different embedding methods:
THE FOLLOWING IS ONLY FOCUSSING ON EVOLUTION-ORRIENTED PERSPECTIVE
Visual Analysis: This step is usually performed to examine how well the embeddings can describe the underlying data. They use PCA to map the embeddings of drugs and diseases into 2D space.
Here they select three versions (2009.01.13, 2014.01.06, and 2018.01.02) of the same seed and compare their representations.
The concept of Explained variance is useful in assessing how important each component is. In general, the larger the variance explained by a principal component, the more important that component is.
Local Neighborhood Similarity: They measure the distance between these vectors with common metrics such as the cosine or euclidean distance. They use the Local Neighborhood similarity to compare embeddings.
The comparison is performed on embeddings of the same version, comparing each one with the remaining embeddings from different runs (10 out of 50 randomly selected runs). They calculate the euclidean and cosine distances between the embeddings of the same version with different seeds. For both distance measures, they obtain a distance matrix for each version instance and set a threshold for the neighborhood such that only embeddings with a distance of less than r are included. The parameter r is defined by the percentiles of each distance distribution. They use the following percentile values: radius = [0:05; 0:1; 1; 10; 20].
With two sets of neighborhoods for each entity, they use the Jaccard index and the overlap coeficient to evaluate their similarity.
They use Welch's t-test and compare the mean of the Jaccard indexes from each version to every other version. As they are running multiple hypothesis tests, they use the Bonferroni method to correct the p values.
Link Prediction: The goal of the prediction model is to differentiate between positive and negative links, which requires training data of both types. However, links are only predicted for known nodes with respectively generated embeddings. Also, while they have access to validated positive samples, in most cases, validated negative samples are unfortunately not provided. Those are often randomly selected from not-linked edges, which enhances the risk of adding unnecessary noise, consequently acting counterproductive.
They extend the link prediction from BioNEV with logical restrictions by
They report the link prediction performance across the evolution where column-wise:
The Paper does not tell how the embedding is applied as the earlier sections did. The figure shows 9 plots: 3 for each embedding method where 0, 3 and 7 step algorithm is performed. There exists 17 box plots on each figure which may represent each version of the NDF-RT, and where AUC-ROC is present on the y-axis. We can probably assume that they are resusing the same embeddings from the Local neighborhood investigation.
They apply Welch's t-test to examine if the mean of the AUC ROC values across versions differs significantly.
This is a Python code defining a model for a recommendation system using a Graph Neural Network (GNN) with heterogeneous nodes and edges. Here's an overview of the code:
The GNN class defines a two-layer GNN, where each layer is a SAGEConv module. The input to the GNN is a feature matrix x and an edge index edge_index that specifies the connectivity between nodes in the graph. The output of the GNN is a new feature matrix that captures the node representations learned by the GNN.
The Classifier class defines a binary classifier that takes as input the node embeddings for a user and a movie, and predicts whether there is an edge between them in the graph. The prediction is based on the dot-product between the two embeddings.
The Model class defines the complete recommendation system model. It consists of three main parts:
a. Two embedding layers for users and movies. These layers are learned during training and map each user and movie node to a vector of hidden_channels dimensions.
b. A GNN module that takes as input the user and movie embeddings and the graph structure, and outputs updated node embeddings for each type of node in the graph.
c. A Classifier module that takes as input the updated user and movie embeddings and makes edge-level predictions based on their dot-product.
The forward method of the Model class takes as input a HeteroData object that contains the graph structure and node features for each node type. It first maps the node IDs to embeddings using the embedding layers, and then passes the embeddings and edge indices to the GNN module to compute updated node embeddings. Finally, it passes the updated node embeddings to the Classifier module to make edge-level predictions.
Overall, this code defines a recommendation system model that can handle graphs with heterogeneous nodes and edges, and uses a GNN to learn node representations that capture the interactions between users and movies in the graph.
Here are some paper that could be relevant and/or interesting to have a lookt at, some of them fairly new. "Computational Drug Reporposing" seems to be the keyword here:
Resume: A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information
This Paper aims to develop a computational pipeline called DTINet. DTI stands for drug-target interactions. The DTINet pipeline is developed in order to predict novel drug-target interactions. Drug-target interactions refer to the interactions between a drug molecule and its target molecule in the body, which can include proteins, enzymes, receptors, or other biomolecules.
The main reason why computational prediction of DTI's has become interesting and important in drug-discovery or drug repurposing is the “guilt-by-association” assumption, which states that similar drugs may share similar targets and vice versa. Thereby DTI prediction is also often formulated as a binary classification task. The therapeutic effect of drugs on diseases may reflect that they have binding activities to the targets, and thereby provide new insights in drug repurposing.
Several computational strategies has been presented in order to integrate heterogeneous data sources to predict DTIs, but the methods generally fails due to lack of expert knowledge and handling bias & noise. The DTINet overcome these problems by learning low-dimensional and informative vector representations of features for both drugs and proteins and integrates information from heterogeneous data sources.
DTINet integrates the information from heterogeneous network by combing Random Walk with restart (RWR) and diffusion component analysis (DCA).
RWR is used to calculate the relative importance of nodes in a graph, with a focus on identifying "seed" nodes that are most closely related to a given query node. RWR is used for Network Difussion which is a process where information are spread through a network, based on the topology of the network and the properties of its nodes and edges.
The DCA algorithm first constructs a graph using the data points and their similarities. It then performs a random walk on the graph, starting from a query node, and calculates the stationary probability distribution of the walker. The resulting probability distribution reflects the importance of each data point in the data space. The algorithm then performs a singular value decomposition (SVD) on the diffusion kernel, which transforms the original high-dimensional dataset into a lower-dimensional space, which can be used for classification tasks. The DCA is algorithm is used for obtaining low-dimensional vector representation.
Drugs with similar topological properties in the heterogeneous network are more likely to be functionally correlated.
They compare DTINet with four state-of-the-art methods for DTI prediction, including BLMNII, NetLapRLS, HNM and CMF. The comparative results shows that DTINet outperforms other existing methods, with 5.9% higher AUROC and 5.7% higher AUPR than the second best method. The originally collected data sets may contain homologous proteins or similar drugs, which is a potential concern. The good predictions might result from easy predictions. The conclusion however is that DTINet outperforms other methods, even without similar drugs and targets.
Some stuff for our discussion: "Although the drug–disease relationships may provide more direct indications of existing drugs, knowing the explicit DTIs can shed light on the underlying pharmacological mechanisms, which are important for understanding both therapeutic and adverse effects of the corresponding drugs."
Here are the three papers that I mentioned yesterday in the meeting. They are all concerned with comparing predictions in a more meaningful manner than just performance. The short answer for getting a number for this comparison is a annotator agreement measure like Cohen's Kappa - this is meant for comparing two predictions (not to a baseline) to understand how these are different from each other, when they have a similar performance.
This paper tries to investigate the if meta paths really are necessary for heterogeneous graph embedding, Meta paths are often used to to do random walks on H-graphs, but this paper argues that it creates more problems than it solves in practice. They argue that meta paths often rely on expert information which in practice can be infeasible. They a proposing the JUST algorithm for embedding nodes with random walks over H-graph. JUST is a heterogeneous graph embedding technique where random walks are performed with JUmp and STay strategies. They conclude that JUST performs better than state of ther art H-graph embedding methods (in particular the hin2vec) and they also observe a faster (3x) embedding
"Drug repositioning based on comprehensive similarity measures and Bi-Random walk algorithm" https://pubmed.ncbi.nlm.nih.gov/27153662/
"Generally speaking, the overall prediction procedure of MBiRW consists of three steps described as follows. Based on the collected dataset, comprehensive similarity measures are firstly implemented to measure similarity for drugs and diseases. Then, a heterogeneous network consisting of drug similarity network, disease similarity network and drug–disease interactions, is constructed. Last, based on the heterogeneous network, BiRW is implemented to rank candidate diseases for drugs."
The method is based on the assumption that similar drugs often indicate similar diseases.
"For this dataset, there are 1933 known drug–disease associations involving 593 drugs registered in DrugBank and 313 diseases listed in the Online Mendelian Inheritance in Man (OMIM)"
Drug similarity measures are computed based on the chemical structure of drugs. They use the "Chemical Development Kit" to calculate the similarity measures between two drugs as their Tanimoto score of their 2D structure. Known drug-disease associations are incorporated in the similarity scores, such that similarity measures lower than a threshold is almost ignored using the logistic function (transform low similarities into some values close to 0). Threshold: "The range of value of $Sim_{r^p}$, which is $[0,1]$, is divided into ten equal subranges. For each subrange, the percentage of drug pairs sharing common diseases is calculated." Do this 10 times using Fisher-Yates shuffle to get randomized drug-similarity measures, and the average is the threshold. They create a weighted network with drugs where edges denote the number of common shared diseases. They cluster the drugs using ClusterONE and cluster statistics is used to calculate the final drug similarity measure of a drug pair. The same process is used for disease similarity, but the initial similarity measures are based similarity between MeSH terms appearing in medical descriptions.
They create a drug and disease similarity network, where the weights on edges between drugs are the comprehensive drug similarity measures and weights on edges between diseases are the comprehensive disease similarity measures. Weights on edges from drug to diesases is initially set to 1 if there exist a known association, and 0 otherwise.
They use a bi-directional random walk (BiRW) simultaneously on the two similarity graphs which outputs a score which represents the probability of an association between a drug-disease pair. BiRW performs a left walk in the drug network and a right walk in the disease network. Each walk returns a value. "The value of element $left_RD_t(i,j)$ and $right_RD_t(i,j)$ denotes the probability that drug $r_i$ associates with disease $d_j$."
They use 10-fold cross validation. AUC=0.917
If we should compare improve State-of-art methods, we could potentially look at the following papers, which is applying different embedding methods and link-prediction methods: https://www.frontiersin.org/articles/10.3389/fgene.2021.657182/full#B3 https://www.merlin.uzh.ch/publication/show/22316 https://deepai.org/publication/graph-embedding-on-biomedical-networks-methods-applications-and-evaluations https://www.mdpi.com/2218-273X/12/10/1497
We have already discussed all the papers before. We can potentially combine different embedding methods and link prediction methods, and see if different combinations of already approved methods, can be improved.
Layer attention graph convolutional network, here it is used in miRNA association predictions: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-022-01807-8
https://www.nature.com/articles/s42256-020-00285-9
Goal Proposition of a mechanism-driven neural network-based method called DeepCE in order to model chemical substructure–gene and gene–gene associations. To demonstrate the value of DeepCE, they apply it to drug repurposing of COVID-19.
Overall architecture of DeepCE The architecture is divided into three parts:
Feed-forward refers to the flow of information through the network in a single direction, from the input layer to the output layer. Multihead-head attention is used in Neural Network to learn contextual relationships between words in a sequence.
Experiments They present several baseline models including linear models, vanilla neural network and KNN, which they will compare performance measures to.
Result
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04655-5
Goal The goal is to devlop a heterogeneous graph neural network model, named as HGDTI, which includes a learning phase of network node embedding and a training phase of Drug Target Interaction (DTI) classifcation.
Methods and Architecture of HGDTI In this work, the dataset is a heterogeneous graph composed of various nodes and edges. They include drug, protein, disease and side-effect as node types. Furthermore, they include interaction, similarity and association as edge types. node_type = v, edge_type = r.
HGDTI consists of the following four main steps:
Experiments They compare the performance of HGDTI with five predictive models, including NeoDTI, DTINet, MSCMF, NetLapRLS and BLMNII.
Results
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04911-8
Goal This paper aims to introduce a novel end-to-end model called Graph convolution network based on a multimodal attention mechanism (GCMM). A Graph Convolution Network encoder is used to learn how diseases and drugs are embedded in various perspectives. It aims to incorporate the following:
All into a heterogeneous network.
Methods and Architecture of GCMM In this paper, the problem of drug–disease prediction is treated as the recommendation task from a HN with drugs and diseases as nodes, and interactions or relationships as edges.
The Architecture of GCMM can be splitted up into five parts:
Experiments Known drug–disease association pairs are taken as the positive samples and other pairs as negative instances. Due to the low density of the dataset, 5 fold cross validation is used to evaluate the prediction performance on all positive samples and randomly selected negative instances of the same size. Furthermore, four deep-learning models, including DeepDR, NeoDTI, LAGCN, and NIMGCN are chosen as baseline approaches in order to demonstrate the superiority of GCMM’s performance. They evaluate on the following metrics: Auc, Aupr, F1, ACC, Recall, Precision and Specifcity.
Results The average AUC score is about 0.90, and the average AUPR score is about 0.91. In general, the GCMM has the best performance according to all metrics and in all positive-negative sample ratioes, with some few exceptions.
Furtheremore, they perform some parameter tuning, and test which is the best. The parameters they test are:
https://academic.oup.com/bioinformatics/article/35/24/5191/5497253
Goal This paper aims to develop network-based deep-learning approach, termed deepDR, for in silico drug repurposing by integrating 10 networks: one drug–disease, one drug-side-effect, one drug–target and seven drug–drug networks.
Here we can add the papers we find