jaxels20 / link-predection-on-dda

MIT License
0 stars 0 forks source link

Papers #1

Closed jaxels20 closed 1 year ago

jaxels20 commented 1 year ago

Here we can add the papers we find

jaxels20 commented 1 year ago

Link prediction implementation and evaluation https://docs.dgl.ai/en/0.8.x/tutorials/blitz/4_link_predict.html

chatgpt: In general, vertex embeddings are more commonly used for link prediction tasks because they can capture local information about nodes and their relationships, which is often more informative for making link predictions. However, graph embeddings can be useful in some cases, especially when global information about the graph structure is important for making accurate predictions.

simpel approach:

Kasper98-png commented 1 year ago

Cannot get access to the paper, but they use layer attention graph convolutional network (LAGCN) for the drug–disease association prediction. They use both drug-diseaase, drug-drug, and disease-disease associations to predict drug-disease links. They improved the state-of-art methods by using LAGCN (ROC=0.8750), however it is from October 2020. https://academic.oup.com/bib/article-abstract/22/4/bbaa243/5918381?login=false#no-access-message

rtni20 commented 1 year ago

This book Graph Embedding for Pattern Analysis is a available through AUB online: https://kbdk-aub.primo.exlibrisgroup.com/discovery/fulldisplay?docid=alma9920605871405762&context=L&vid=45KBDK_AUB:AUB&lang=da&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=any,contains,Graph%20Embedding%20for%20Pattern%20Analysis&offset=0

The First chapter covers pattern detection and an explanation about Graph embedding. Furthermore it cover how explicit graph embedding has been approach by the following families of algorithms:

Moreover it covers multilevel analysis for explicit graph embedding which is a method of explicit graph embedding for attributed graphs with many symbolic as well as numeric attributes on both nodes and edges. Cite from the book: "Fuzzy Multilevel Graph Embedding (FMGE) is a straightforward, simple, and computational efficient solution for facilitating the use of graph-based powerful representations together with learning and computational strengths of state-of-the-art machine learning, classification, and clustering."

Kasper98-png commented 1 year ago

A Tutorial on Network Embeddings (Paper): https://arxiv.org/pdf/1808.02590.pdf

Gives a description of graph embedding. Provides eight methods for embedding and the embedding learning method within each method (page 8).

Kasper98-png commented 1 year ago

REDIRECTION: Generating drug repurposing hypotheses using link prediction with DISNET data (July 27, 2022): https://www.biorxiv.org/content/10.1101/2022.07.26.501105v1.full

They use two types of nodes: phenotypes (diseases and symptoms) and drugs. They use a two-layer convolutional approach, and gets ROC=0.93. They gives ideas on how to improve the model, such as use other types of nodes such as genes, proteins, metabolic pathways, drug-drug interactions, and so on.

rtni20 commented 1 year ago

This paper wants to achieve the same goal as us. https://www.frontiersin.org/articles/10.3389/fgene.2021.657182/full#B3

Instead of using Graph Embedding, they use graph convolutional neural network to pre-train and generate the graph. Then they use node2vec as network representation and Random forest classifier to obtain predictions.

I also think we could use this paper as some kind of inspiration for structure of the project.

jaxels20 commented 1 year ago

Getting an overview of the architecture of a gnn: https://theaisummer.com/gnn-architectures/

rtni20 commented 1 year ago

This Article describes embeddings very well and explain different methods for Vetex embedding. Vertex embedding can be divided into factorization approaches, random walk approaches, and deep approaches.

https://towardsdatascience.com/graph-embeddings-the-summary-cc6075aba007

rtni20 commented 1 year ago

This article explains Graph Neural Networks (GNN). Furthermore it explains how Graph Convolutional Networks (GCN) works as building blocks for the Graph Autoencoder (GAE) architecture. It also explores and compare the GAE to the Variational Graph Autoencoder (VGAE). Overall, the article is good as it shows how this is implemented in Python and how to evaluate the method(s).

https://medium.com/stanford-cs224w/disease-gene-interactions-with-graph-neural-networks-and-graph-autoencoders-bfae2c28f5ec

rtni20 commented 1 year ago

A simple explanation of GNN: https://theaisummer.com/Graph_Neural_Networks/

Kasper98-png commented 1 year ago

Link prediction on graph streams. http://www.charuaggarwal.net/ICDE16_research_307.pdf

rtni20 commented 1 year ago

Resume of Thesis: Robustness of Drug-Disease-Association Network Embeddings - Tenzen Rabgang

Two tasks are performed:

  1. Neighborhood similarity comparison
  2. Link prediction (in order to forecast possible associations)

Two experiments are performed:

  1. Investigation of the evolution of the DDA Network
  2. Investigate how the addition of noise affects the results and the robustness of the embeddings.

Graph Embeddings: https://academic.oup.com/bioinformatics/article/36/4/1241/5581350 (Paper) use several biomedical networks and apply 11 different embedding methods on them. Their research paper is especially important for the thesis as they also use the DDA data set and run different embedding methods on it. It is highly used to determine which embedding methods to test.

NDF-RT: They only use the relation May_treat - "The relations within and between the entities are very similar in their meaning; hence, we can presume a connection between them such as e.g., may treat pairs overlap with may prevent, or may diagnose pairs."

They find it easier to work with the data when converting from XML to OWL format, so they convert all 82 versions.

When they start from the premise of a directed drug-disease graph, drug nodes have on average approximately four outgoing links while disease nodes hold around 50 incoming links.

They leave out the MED-RT due to time limitations of making OWL representations.

Embedding Approach and Methods: They select 17 DDA Versions (every 5th) and create embeddings with dimension size 100 for 50 runs. The embedding dimension in graph analysis is typically determined by the user or chosen using heuristics, and it determines the size of the embedding space for the graph. Running the algorithm 50 times likely refers to performing the embedding process multiple times with different random initializations. Running the algorithm multiple times with different random seeds can help to generate more diverse embeddings and reduce the risk of getting stuck in a local optimum. When an embedding algorithm gets stuck in a local optimum, it means that it has converged to a suboptimal solution that may not accurately represent the underlying data.

For the additional experiment, they choose one version (2014.06.02) as the ground truth and select two versions (2009.01.03 and 2018.01.02) as synthetic ones with different noise levels. To compare them, they fist remove all nodes that do not appear in the ground-truth version. The same applies for the synthetic versions, where they remove nodes that exist in the ground-truth but not in the synthetic versions. In the end, the ground-truth and the synthetic versions comprise the same set of nodes but with different edges. We generate ten embeddings of the ground-truth and the synthetic versions.

They choose three different embedding methods:

  1. GraRep (Matrix Factorization method)
  2. LINE (Neural Network method)
  3. BiNE (Random Walk method) - the reason they did not choose the struc2vec method is because of computaionally expensiveness.

THE FOLLOWING IS ONLY FOCUSSING ON EVOLUTION-ORRIENTED PERSPECTIVE

Visual Analysis: This step is usually performed to examine how well the embeddings can describe the underlying data. They use PCA to map the embeddings of drugs and diseases into 2D space.

Here they select three versions (2009.01.13, 2014.01.06, and 2018.01.02) of the same seed and compare their representations.

  1. GraRep - The explained variance ratio of the principal components is in total 13%, which means that PCA is unable to retain most of the embedding information on two dimensions.
  2. LiNE - In addition, the explained variance ratio of the principal components adds up to 15%, which is still very low.
  3. The explained variance ratio makes up less than 2% so that the drug points and disease points contain only little information about the initial embeddings.

The concept of Explained variance is useful in assessing how important each component is. In general, the larger the variance explained by a principal component, the more important that component is.

Local Neighborhood Similarity: They measure the distance between these vectors with common metrics such as the cosine or euclidean distance. They use the Local Neighborhood similarity to compare embeddings.

The comparison is performed on embeddings of the same version, comparing each one with the remaining embeddings from different runs (10 out of 50 randomly selected runs). They calculate the euclidean and cosine distances between the embeddings of the same version with different seeds. For both distance measures, they obtain a distance matrix for each version instance and set a threshold for the neighborhood such that only embeddings with a distance of less than r are included. The parameter r is defined by the percentiles of each distance distribution. They use the following percentile values: radius = [0:05; 0:1; 1; 10; 20].

With two sets of neighborhoods for each entity, they use the Jaccard index and the overlap coeficient to evaluate their similarity.

They use Welch's t-test and compare the mean of the Jaccard indexes from each version to every other version. As they are running multiple hypothesis tests, they use the Bonferroni method to correct the p values.

Link Prediction: The goal of the prediction model is to differentiate between positive and negative links, which requires training data of both types. However, links are only predicted for known nodes with respectively generated embeddings. Also, while they have access to validated positive samples, in most cases, validated negative samples are unfortunately not provided. Those are often randomly selected from not-linked edges, which enhances the risk of adding unnecessary noise, consequently acting counterproductive.

They extend the link prediction from BioNEV with logical restrictions by

  1. allowing only drug-disease links for negative samples
  2. providing a reduced selection of reliablennegative edges. The second restriction is inspired by Wu and Liu (Paper) - they introduce a method that is called Reliable Negative Samples Selection. Their algorithm is based on the widely used assumption that similar drugs may treat similar diseases. The idea is simple: nodes with an (indirect) relation to each other are not included in the selection.

They report the link prediction performance across the evolution where column-wise:

  1. no reliable negative samples exist
  2. 3-step neighbors are excluded from the negative sample selection
  3. 7-step neighbors are excluded

The Paper does not tell how the embedding is applied as the earlier sections did. The figure shows 9 plots: 3 for each embedding method where 0, 3 and 7 step algorithm is performed. There exists 17 box plots on each figure which may represent each version of the NDF-RT, and where AUC-ROC is present on the y-axis. We can probably assume that they are resusing the same embeddings from the Local neighborhood investigation.

They apply Welch's t-test to examine if the mean of the AUC ROC values across versions differs significantly.

jaxels20 commented 1 year ago

This is a Python code defining a model for a recommendation system using a Graph Neural Network (GNN) with heterogeneous nodes and edges. Here's an overview of the code:

The GNN class defines a two-layer GNN, where each layer is a SAGEConv module. The input to the GNN is a feature matrix x and an edge index edge_index that specifies the connectivity between nodes in the graph. The output of the GNN is a new feature matrix that captures the node representations learned by the GNN.

The Classifier class defines a binary classifier that takes as input the node embeddings for a user and a movie, and predicts whether there is an edge between them in the graph. The prediction is based on the dot-product between the two embeddings.

The Model class defines the complete recommendation system model. It consists of three main parts:

a. Two embedding layers for users and movies. These layers are learned during training and map each user and movie node to a vector of hidden_channels dimensions.

b. A GNN module that takes as input the user and movie embeddings and the graph structure, and outputs updated node embeddings for each type of node in the graph.

c. A Classifier module that takes as input the updated user and movie embeddings and makes edge-level predictions based on their dot-product.

The forward method of the Model class takes as input a HeteroData object that contains the graph structure and node features for each node type. It first maps the node IDs to embeddings using the embedding layers, and then passes the embeddings and edge indices to the GNN module to compute updated node embeddings. Finally, it passes the updated node embeddings to the Classifier module to make edge-level predictions.

Overall, this code defines a recommendation system model that can handle graphs with heterogeneous nodes and edges, and uses a GNN to learn node representations that capture the interactions between users and movies in the graph.

pernisch commented 1 year ago

Here are some paper that could be relevant and/or interesting to have a lookt at, some of them fairly new. "Computational Drug Reporposing" seems to be the keyword here:

jaxels20 commented 1 year ago

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03682-4

rtni20 commented 1 year ago

Resume: A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information

This Paper aims to develop a computational pipeline called DTINet. DTI stands for drug-target interactions. The DTINet pipeline is developed in order to predict novel drug-target interactions. Drug-target interactions refer to the interactions between a drug molecule and its target molecule in the body, which can include proteins, enzymes, receptors, or other biomolecules.

The main reason why computational prediction of DTI's has become interesting and important in drug-discovery or drug repurposing is the “guilt-by-association” assumption, which states that similar drugs may share similar targets and vice versa. Thereby DTI prediction is also often formulated as a binary classification task. The therapeutic effect of drugs on diseases may reflect that they have binding activities to the targets, and thereby provide new insights in drug repurposing.

Several computational strategies has been presented in order to integrate heterogeneous data sources to predict DTIs, but the methods generally fails due to lack of expert knowledge and handling bias & noise. The DTINet overcome these problems by learning low-dimensional and informative vector representations of features for both drugs and proteins and integrates information from heterogeneous data sources.

DTINet integrates the information from heterogeneous network by combing Random Walk with restart (RWR) and diffusion component analysis (DCA).

RWR is used to calculate the relative importance of nodes in a graph, with a focus on identifying "seed" nodes that are most closely related to a given query node. RWR is used for Network Difussion which is a process where information are spread through a network, based on the topology of the network and the properties of its nodes and edges.

The DCA algorithm first constructs a graph using the data points and their similarities. It then performs a random walk on the graph, starting from a query node, and calculates the stationary probability distribution of the walker. The resulting probability distribution reflects the importance of each data point in the data space. The algorithm then performs a singular value decomposition (SVD) on the diffusion kernel, which transforms the original high-dimensional dataset into a lower-dimensional space, which can be used for classification tasks. The DCA is algorithm is used for obtaining low-dimensional vector representation.

Drugs with similar topological properties in the heterogeneous network are more likely to be functionally correlated.

They compare DTINet with four state-of-the-art methods for DTI prediction, including BLMNII, NetLapRLS, HNM and CMF. The comparative results shows that DTINet outperforms other existing methods, with 5.9% higher AUROC and 5.7% higher AUPR than the second best method. The originally collected data sets may contain homologous proteins or similar drugs, which is a potential concern. The good predictions might result from easy predictions. The conclusion however is that DTINet outperforms other methods, even without similar drugs and targets.

Some stuff for our discussion: "Although the drug–disease relationships may provide more direct indications of existing drugs, knowing the explicit DTIs can shed light on the underlying pharmacological mechanisms, which are important for understanding both therapeutic and adverse effects of the corresponding drugs."

pernisch commented 1 year ago

Here are the three papers that I mentioned yesterday in the meeting. They are all concerned with comparing predictions in a more meaningful manner than just performance. The short answer for getting a number for this comparison is a annotator agreement measure like Cohen's Kappa - this is meant for comparing two predictions (not to a baseline) to understand how these are different from each other, when they have a similar performance.

jaxels20 commented 1 year ago

http://shichuan.org/hin/time/2018.CIKM%202018%20Are%20Meta-Paths%20Necessary_%20Revisiting%20Heterogeneous%20Graph%20Embeddings.pdf

This paper tries to investigate the if meta paths really are necessary for heterogeneous graph embedding, Meta paths are often used to to do random walks on H-graphs, but this paper argues that it creates more problems than it solves in practice. They argue that meta paths often rely on expert information which in practice can be infeasible. They a proposing the JUST algorithm for embedding nodes with random walks over H-graph. JUST is a heterogeneous graph embedding technique where random walks are performed with JUmp and STay strategies. They conclude that JUST performs better than state of ther art H-graph embedding methods (in particular the hin2vec) and they also observe a faster (3x) embedding

Kasper98-png commented 1 year ago

"Drug repositioning based on comprehensive similarity measures and Bi-Random walk algorithm" https://pubmed.ncbi.nlm.nih.gov/27153662/

"Generally speaking, the overall prediction procedure of MBiRW consists of three steps described as follows. Based on the collected dataset, comprehensive similarity measures are firstly implemented to measure similarity for drugs and diseases. Then, a heterogeneous network consisting of drug similarity network, disease similarity network and drug–disease interactions, is constructed. Last, based on the heterogeneous network, BiRW is implemented to rank candidate diseases for drugs."

The method is based on the assumption that similar drugs often indicate similar diseases.

"For this dataset, there are 1933 known drug–disease associations involving 593 drugs registered in DrugBank and 313 diseases listed in the Online Mendelian Inheritance in Man (OMIM)"

Drug similarity measures are computed based on the chemical structure of drugs. They use the "Chemical Development Kit" to calculate the similarity measures between two drugs as their Tanimoto score of their 2D structure. Known drug-disease associations are incorporated in the similarity scores, such that similarity measures lower than a threshold is almost ignored using the logistic function (transform low similarities into some values close to 0). Threshold: "The range of value of $Sim_{r^p}$, which is $[0,1]$, is divided into ten equal subranges. For each subrange, the percentage of drug pairs sharing common diseases is calculated." Do this 10 times using Fisher-Yates shuffle to get randomized drug-similarity measures, and the average is the threshold. They create a weighted network with drugs where edges denote the number of common shared diseases. They cluster the drugs using ClusterONE and cluster statistics is used to calculate the final drug similarity measure of a drug pair. The same process is used for disease similarity, but the initial similarity measures are based similarity between MeSH terms appearing in medical descriptions.

They create a drug and disease similarity network, where the weights on edges between drugs are the comprehensive drug similarity measures and weights on edges between diseases are the comprehensive disease similarity measures. Weights on edges from drug to diesases is initially set to 1 if there exist a known association, and 0 otherwise.

They use a bi-directional random walk (BiRW) simultaneously on the two similarity graphs which outputs a score which represents the probability of an association between a drug-disease pair. BiRW performs a left walk in the drug network and a right walk in the disease network. Each walk returns a value. "The value of element $left_RD_t(i,j)$ and $right_RD_t(i,j)$ denotes the probability that drug $r_i$ associates with disease $d_j$."

They use 10-fold cross validation. AUC=0.917

rtni20 commented 1 year ago

If we should compare improve State-of-art methods, we could potentially look at the following papers, which is applying different embedding methods and link-prediction methods: https://www.frontiersin.org/articles/10.3389/fgene.2021.657182/full#B3 https://www.merlin.uzh.ch/publication/show/22316 https://deepai.org/publication/graph-embedding-on-biomedical-networks-methods-applications-and-evaluations https://www.mdpi.com/2218-273X/12/10/1497

We have already discussed all the papers before. We can potentially combine different embedding methods and link prediction methods, and see if different combinations of already approved methods, can be improved.

Kasper98-png commented 1 year ago

Layer attention graph convolutional network, here it is used in miRNA association predictions: https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-022-01807-8

rtni20 commented 1 year ago

Examples of papers using GCN of embeddings

A deep learning framework for high-throughput mechanism-driven phenotype compound screening and its application to COVID-19 drug repurposing

https://www.nature.com/articles/s42256-020-00285-9

Goal Proposition of a mechanism-driven neural network-based method called DeepCE in order to model chemical substructure–gene and gene–gene associations. To demonstrate the value of DeepCE, they apply it to drug repurposing of COVID-19.

Overall architecture of DeepCE The architecture is divided into three parts:

  1. The feature transformation component that uses GCN to generate features for chemical compound, pre-trained information to represent L1000 genes, and feed-forward neural network to generate features for cell and dosage.
  2. The interaction network that learns high-level feature associations (which conists of two layers, where the two layers shares architectures). The components included in both layers are: multihead-attention, concat, sum, normalization and feed-forward.
  3. The prediction network that predicts gene expression profiles from high-level features. The multi-output prediction is a two-layer feed-forward neural network with a rectified linear unit (ReLU) activation function—takes input as the concatenation of the chemical neural fingerprint, the gene feature generated by the interaction component, the cell line and chemical dosage features, to predict gene expression values.

Feed-forward refers to the flow of information through the network in a single direction, from the input layer to the output layer. Multihead-head attention is used in Neural Network to learn contextual relationships between words in a sequence.

Experiments They present several baseline models including linear models, vanilla neural network and KNN, which they will compare performance measures to.

Result

HGDTI: predicting drug–target interaction by using information aggregation based on heterogeneous graph neural network

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04655-5

Goal The goal is to devlop a heterogeneous graph neural network model, named as HGDTI, which includes a learning phase of network node embedding and a training phase of Drug Target Interaction (DTI) classifcation.

Methods and Architecture of HGDTI In this work, the dataset is a heterogeneous graph composed of various nodes and edges. They include drug, protein, disease and side-effect as node types. Furthermore, they include interaction, similarity and association as edge types. node_type = v, edge_type = r.

HGDTI consists of the following four main steps:

  1. Node features encoding. In this step, they defne a submodule based on bi-directional LSTM (Bi-LSTM) to capture “deep” feature interactions and obtain more abstract nonlinear expressions. They use Bi-LSTM to extract the general content embedding of v.
  2. Homogeneous neighbors aggregation. In this step, they design a submodule that aggregates heterogeneous adjacent node features. r-type aggregated embedding for node v is summed by same type neighbors feature to multiply by ratio which is the normalized weight with respect to edges of type r.
  3. Heterogeneous neighbors aggregation. Taking into account that heterogeneous nodes have diferent degrees of impact on the final embeddings, they employ the attention mechanism to incorporate the aggregated embedding with the initial feature of node v.
  4. Predictor training process. This process is a deep neural network classifer, which is used to predict DTIs by training the node embedding to obtain a 0-1 threshold. The Neural network is formed by two layers. FC1 and FC2 form a two-layer fully connected neural network that performs a linear transformation on embeddings.

Experiments They compare the performance of HGDTI with five predictive models, including NeoDTI, DTINet, MSCMF, NetLapRLS and BLMNII.

Results

GCMM: graph convolution network based on multimodal attention mechanism for drug repurposing

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-022-04911-8

Goal This paper aims to introduce a novel end-to-end model called Graph convolution network based on a multimodal attention mechanism (GCMM). A Graph Convolution Network encoder is used to learn how diseases and drugs are embedded in various perspectives. It aims to incorporate the following:

All into a heterogeneous network.

Methods and Architecture of GCMM In this paper, the problem of drug–disease prediction is treated as the recommendation task from a HN with drugs and diseases as nodes, and interactions or relationships as edges.

The Architecture of GCMM can be splitted up into five parts:

  1. Construction of Heterogenous Network - Here the hetrogenous network is build between the different nodes and edge types mentioned in the goal section.
  2. 2-layers multi-view GCN encoder - In GCMM, a multi-view GCN encoder on four similarity networks is used to learn drug and disease low-dimensional representations, with Relu as activation function.
  3. Multidimensional based attention mechanism - The more parameters a model has, the more expressive. But it can lead to information overload. The issue of information overload can be resolved, and the efectiveness and accuracy of task processing can be enhanced, by introducing attention mechanisms to focus on the information that is more important to the current task, reduce attention to other information, and flter out irrelevant information.
  4. Fully Connected feature extractor - In this case, it is utilized to integrate multiple view information and generate final embedding.
  5. Matrix complete decoder - The learned drug and disease embeddings from the encoder are input into the matrix completion module, and the preference prediction problem is treated as a recommendation task.

Experiments Known drug–disease association pairs are taken as the positive samples and other pairs as negative instances. Due to the low density of the dataset, 5 fold cross validation is used to evaluate the prediction performance on all positive samples and randomly selected negative instances of the same size. Furthermore, four deep-learning models, including DeepDR, NeoDTI, LAGCN, and NIMGCN are chosen as baseline approaches in order to demonstrate the superiority of GCMM’s performance. They evaluate on the following metrics: Auc, Aupr, F1, ACC, Recall, Precision and Specifcity.

Results The average AUC score is about 0.90, and the average AUPR score is about 0.91. In general, the GCMM has the best performance according to all metrics and in all positive-negative sample ratioes, with some few exceptions.

Furtheremore, they perform some parameter tuning, and test which is the best. The parameters they test are:

deepDR: a network-based deep learning approach to in silico drug repositioning

https://academic.oup.com/bioinformatics/article/35/24/5191/5497253

Goal This paper aims to develop network-based deep-learning approach, termed deepDR, for in silico drug repurposing by integrating 10 networks: one drug–disease, one drug-side-effect, one drug–target and seven drug–drug networks.