jaxels20 / link-predection-on-dda

MIT License
0 stars 0 forks source link

Understanding graph neural networks #3

Closed Kasper98-png closed 1 year ago

Kasper98-png commented 1 year ago

What is a grah neural network?

A graph neural network is a class of neural networks that takes a graph (nodes and edges), feature vectors representing nodes and edges as input and and uses layers to generate an embedding. Most GNNs use message passing to capture information from local neighborhoods and propagate it through the graph to generate the node embeddings. Possible approach for link prediction: One GNN can be used for generating the node embedding which can then be used in another GNN that is used as a classifier. Specifically for a grahp convolutional network: "GCNs achieve this by using a message passing scheme to aggregate information from a node's neighbors. Specifically, the node features are first transformed using a weight matrix, and then aggregated with the transformed features of the node's neighbors. This information is then passed through a non-linear activation function and used to update the node embeddings." (chatgtp)

What is message passing?

Message passing is where the network aggregates the feature vectors of all neighbors to a given node. This can be thought of as feature smoothing, where each feature representing a node becomes an aggregate of its neighbors feature vector. This way the information about the neighborhood is captured. Each layer in a graph neural network corresponds to one message pass between all nodes and their neighbors. Most GNNs use some form of message passing, i.e. graph convolutional networks.

How does layers work and how many layers are mostly used?

A layer corresponds to a message pass, and each layer generates a hidden layer that can be used in the next layer. A layer aggregates the feature vectors of the neighbors of a given node, and multiplies it by some matrix and applies an activation function (pass the aggregation through a dense neural network layer). The output is the new representation of the given node. A layer has an input size and an output size. For example, if the feature vectors has size 10, the first layer has input size 10. That layer can have output size 5, which is then the input size of the next layer. 2-3 layers is shown to achieve the best performance on classification problems (https://arxiv.org/pdf/1609.02907.pdf). Risk overfitting when using many layers, because the number of parameters increase with the model depth (number of layers). In the paper for testing the model depth they use the following parameters on each cross-validation split: 400 epochs (without early stopping) using the Adam optimizer with a learning rate of 0.01. Other hyperparameters are chosen as follows: 0.5 (dropout rate, first and last layer), $5\cdot 10^{−4}$ (L2 regularization, first layer), 16 (number of units for each hidden layer) and 0.01 (learning rate).

Does different types of layers exist and which is the most popular?

Many different types of layers/message passing mechanisms exist. For example SAGEConv from in torch_geometric (implementation of GraphSAGE, a general, inductive framework), which is a convolutional layer, that "instead of training individual embeddings for each node, we learn a function that generates embeddings by sampling and aggregating features from a node's local neighborhood" (https://arxiv.org/abs/1706.02216). The most popular types of layers to use in GNNs are graph convolutional layers and graph attention layers (GAT). "Graph attention layers dynamically weigh the importance of different nodes in the graph. GATs use self-attention to calculate attention coefficients that determine how much information to propagate from each neighbor node to the current node." (chatgtp)

How does input/output size affect the network?

The bigger the input size, the more information the layer can capture, however, it also increases the computational complexity. Chatgpt says that the size of the input and output of each convolutional layer should be the same as the size of the initial input feature vector, but I cannot find support for that anywhere.

Which aggregators are mostly used?

To the best of my knowledge: mean.

What is the most used activation function?

To the best of my knowledge: ReLu is mostly used ($f(x)=max(0,x)$). It can suffer from "dying ReLu" where neurons "die" and stop learning. LeakyReLu attempts to solve this issue ($f(x) = x$ if $x > 0$ and $f(x) = αx$ if $x <= 0$, where $α$ is a small positive constant.).

I have seen another paper use pytorch and they use Adam as optimizer:(https://www.sciencedirect.com/science/article/pii/S1532046422001496?casa_token=LkCGDLzpTc0AAAAA:--M5N6O2sYPbn06q2u2-5cthEFnzGufplVMrKENQRdyhhxAvuA_b9kfsZBYMkl48sZ0UqEVUb_Y)

jaxels20 commented 1 year ago

Things to investigate:

GNN

Pooling:

Chat: "The main purpose of a pooling layer is to reduce the spatial size (dimensionality) of the input while retaining its most important features. In a GNN, a pooling layer can be used to aggregate node features across the graph. For example, global pooling aggregates node features into a single feature vector by taking the maximum, minimum, or average value of the node features. This operation helps to summarize the information from the entire graph and can be useful for graph classification or node-level prediction tasks." Helps to prevent overfitting and improve computational efficiency. Use carefully, it can cause information loss, especially when used excesssively. Mostly used to graph classification or when working with large graphs that are computationally expensive to process. A pooling layer can be seen as a combination of three functions: selection, reduction, connection (SRC). "With selection, the operator computes K subsets of nodes, each associated with one node of the output $G'$; we refer to them as supernodes. With reduction, the operator aggregates the node attributes in each supernode to obtain the node attributes of $G'$ . Finally, the connection step computes edges among the $K$ reduced nodes. " (https://arxiv.org/pdf/2110.05292.pdf)

Dropout:

Chat: "Dropout is a regularization technique commonly used in GNNs to prevent overfitting during training. It involves randomly dropping out (i.e., setting to zero) a fraction of the nodes or edges in the graph during each forward pass of the network. Dropout is applied independently to each node or edge with a probability specified by the user. ... By randomly dropping out nodes or edges during training, the network is forced to rely on a more diverse set of features, which can lead to better generalization performance on unseen data. ... It is also important to note that dropout should only be applied during training and turned off during inference, as the goal during inference is to use the full power of the network to make accurate predictions." Dropout can be used in both as a seperate layer or as a function within each layer. For example, use the Dropout class as a layer to dropout neurons or use the dropout parameter in GATConv to drop out attention coefficients. Dropout as a seperate layer is common practice (chat).

Attention mechanisms:

Chat: "An attention mechanism in Graph Neural Networks (GNNs) is a mechanism that allows the network to selectively focus on the most relevant nodes or edges in the graph when computing node or edge representations." Traditional GNNs treat each node equally in the message passing, without considering their relative importance or relevance for the task at hand. "An attention mechanism in GNNs addresses this limitation by allowing the network to learn an attention weight for each neighboring node or edge, which indicates its importance or relevance. These attention weights are typically learned through a trainable function that takes as input the features of the target node or edge and the neighboring nodes or edges." The attention weights are learned. The importance of each neighbor is calculated using three or steps: make a linear transformation of the two nodes, by concatting them and multiplying the vector with a weight matrix. Apply activation function. Normalize the result to be able to compare the importance of each neighbor. This can be done using different headers, which is replicating the three steps multiple times, and aggregating the result of the headers (https://towardsdatascience.com/graph-attention-networks-in-python-975736ac5c0c).

Overfitting:

The network can be trained to perform perfectly on the training data, but perform poorly on new unseen data. You can detect overfitting by inspecting the performance measures when using the validation set; Bad performance on validation set may indicate overfitting. Futhermore, the magnitude and high variance in the model weights may indicate overfitting. If the model weights are very large or has high variance, the model might not generalize to unseen data. This can be detected by monitoring the magnitude of the weight values during training. There exist several techniques to prevent it, including regularization methods such as Dropout and L2 regularization, early stopping, increasing the size of the dataset, and reducing the complexity of the model.

NN

Weight-decay:

Chat: "Weight decay is a regularization technique used in neural networks to prevent overfitting. It is also sometimes referred to as L2 regularization or ridge regression. In weight decay, an additional term is added to the loss function of the neural network, which penalizes large weight values. This additional term encourages the weights to be small, which helps to prevent the model from fitting the training data too closely and overfitting. The weight decay term is typically proportional to the L2 norm of the weight vector, which is calculated by summing the squares of all the weights in the network. The regularization strength is controlled by a hyperparameter, often denoted by $\lambda$ or $\alpha$. During training, the loss function is optimized using backpropagation, which involves calculating the gradients of the loss function with respect to the weights of the network. The weight decay term is included in the calculation of these gradients, which causes the weights to be updated in a way that both reduces the training loss and keeps the weights small."

Early stopping

Early stopping is a technique used in neural networks to prevent overfitting and improve generalization performance. The idea behind early stopping is to monitor the performance of the model on a validation set during training, and stop the training process when the performance on the validation set stops improving or starts to degrade.

During training, the neural network is typically trained for a fixed number of epochs or until the validation loss reaches a minimum. However, if the network is trained for too long, it may start to overfit the training data, which means it becomes very good at predicting the training data but performs poorly on new, unseen data. This is undesirable as the ultimate goal of the neural network is to be able to generalize well to new data.

By monitoring the validation loss during training, early stopping can help determine the optimal point to stop training the network, before it starts to overfit the training data. This is done by saving the model parameters at each epoch, and selecting the model with the lowest validation loss as the final model.

The early stopping technique can be implemented in different ways, but the most common approach is to use a separate validation set from the training set, where a portion of the training data is set aside for validation. The training process is then stopped when the validation loss stops decreasing for a specified number of epochs, or when it starts to increase.

Batch normalization

Batch normalization is a technique used in neural networks to improve the training stability and speed of convergence. It involves normalizing the activations of each layer in the network by subtracting the batch mean and dividing by the batch standard deviation. The normalization is applied to each mini-batch of training samples, hence the name "batch normalization".

The benefits of batch normalization are as follows:

Accelerating training: By normalizing the input to each activation function, batch normalization reduces the covariate shift problem, which helps to improve the training speed of the neural network.

Regularizing the model: Batch normalization also acts as a form of regularization by adding noise to the network. This noise helps to reduce overfitting and improves the generalization ability of the model.

Reducing the sensitivity to initialization: Batch normalization helps to reduce the sensitivity of the neural network to the initialization of the weights, which can sometimes cause training difficulties.

Allowing for higher learning rates: Batch normalization allows for the use of higher learning rates, which can speed up training and improve the final performance of the model.

Batch normalization is typically applied after the activation function and before the next layer in the neural network. It can be applied to convolutional layers, fully connected layers, and recurrent layers in the network.

Residual connections

Residual connections, also known as skip connections, are a technique used in neural networks to address the vanishing gradient problem that can occur when training very deep networks. The vanishing gradient problem occurs when gradients become very small as they are propagated back through the layers of the network during training, which can make it difficult to update the parameters of early layers in the network.

Residual connections solve this problem by introducing a shortcut connection that allows the input of a layer to bypass one or more layers and be added directly to the output of a later layer. This shortcut connection creates a residual block, which allows the network to learn residual functions, that is, the difference between the input and output of a block.

The addition of the residual connection enables the network to preserve information from earlier layers, making it easier for the network to learn and propagate gradients back through the network. This allows for deeper networks to be trained without encountering the vanishing gradient problem.

The residual connections can be implemented in different ways, such as using identity mappings or projection mappings. In identity mappings, the input is added directly to the output of the layer, while in projection mappings, the input is first transformed by a linear layer before being added to the output of the layer.

Residual connections have been shown to be effective in a variety of deep learning architectures, including convolutional neural networks (CNNs), residual networks (ResNets), and long short-term memory (LSTM) networks, among others.

LSTM

Does not really seem relevant but, LSTM (Long Short-Term Memory) is a type of recurrent neural network architecture that is particularly useful for processing sequential data, such as time series, speech, and natural language. LSTM networks were first introduced by Hochreiter and Schmidhuber in 1997, and since then, they have become widely used in a variety of applications.

The main advantage of LSTM networks over traditional recurrent neural networks (RNNs) is their ability to learn and remember long-term dependencies in sequential data. LSTMs achieve this by introducing a set of memory cells that are able to store information over multiple time steps, as well as three types of gates (input, output, and forget) that regulate the flow of information into and out of the memory cells.