jaxels20 / link-predection-on-dda

MIT License
0 stars 0 forks source link

Chat #4

Closed jaxels20 closed 1 year ago

jaxels20 commented 1 year ago

When dividing a graph into train-test-validation sets for link prediction, there are a few considerations to keep in mind:

Preserve the graph structure: The train-test-validation split should preserve the underlying graph structure, meaning that edges should only be removed from the training set and not the validation or test set. This is important because link prediction is a transductive task, meaning that the model must predict links between nodes that it has not seen during training.

Balanced distribution: The train-test-validation split should ensure a balanced distribution of positive and negative edges in each set. In other words, the ratio of positive to negative edges should be roughly the same in each set to prevent bias in the model.

Randomness: The train-test-validation split should be random to ensure that the sets are representative of the overall graph and prevent overfitting to specific parts of the graph.

One common approach to dividing a graph into train-test-validation sets for link prediction is to randomly sample a percentage of the edges to use as the test set, and then randomly sample a percentage of the remaining edges to use as the validation set. The remaining edges are used as the training set.

Here are the steps to implement this approach:

Shuffle the edges in the graph randomly. Split a percentage of the edges (e.g. 10-20%) into a test set. Split another percentage of the edges (e.g. 10-20%) into a validation set. The remaining edges are used as the training set. Ensure that each set has a balanced distribution of positive and negative edges. Note that the exact percentages for the test and validation sets can vary depending on the size and complexity of the graph, as well as the specific task at hand. The percentages should be chosen based on empirical evaluation and experimentation.