@article{aiyappa2024implicit,
title={Implicit degree bias in the link prediction task},
author={Rachith Aiyappa and Xin Wang and Munjung Kim and Ozgur Can Seckin and Jisung Yoon and Yong-Yeol Ahn and Sadamori Kojaku},
journal={arxiv: 2405.14985}
year={2024}
}
This repository provides the code to generate the degree-corrected link prediction task.
pip install "git+https://git@github.com/skojaku/degree-corrected-link-prediction.git#subdirectory=libs/dclinkpred&egg=dclinkpred"
or
git clone https://github.com/skojaku/degree-corrected-link-prediction.git
cd degree-corrected-link-prediction/libs/dclinkpred
pip install -e .
from dclinkpred import LinkPredictionDataset
import networkx as nx
# Create a karate club graph
G = nx.karate_club_graph()
# While the graph can be networkx object, the adjacency matrix is recommended for the efficiency
G = nx.adjacency_matrix(G)
lpdata = LinkPredictionDataset(
testEdgeFraction=0.2, # 20% of the edges will be used for testing
degree_correction=True, # degree correction will be applied
negatives_per_positive=10, # 10 negative samples will be generated for each positive sample
allow_duplicatd_negatives=False, # Do not allow duplicate negative edges
)
lpdata.fit(G) # Fit the dataset
train_net, src_test, trg_test, y_test = lpdata.transform() # Transform the dataset
train_net # The network for training
src_test # The source nodes of the test edges
trg_test # The destination nodes of the test edges
y_test # The labels of the test edges, where 1 means positive and 0 means negative
We provide all source code and data to reproduce the results in the paper. We tested the workflow under the following environment.
All code are provided in the reproduction/
directory. The expected execution time varies depending on the computational resources. With our machine equipped with 8 NVIDIA V100 GPUs and 64 CPUs, the execution time for the entire workflow, including the robustness analysis, is approximately one week.
We provide the source of the network data in the edge list format at FigShare.
The edge list is a CSV file with 2 columns representing the source and destination nodes of the network.
Download the data and place it in the reproduction/data/raw
directory.
We recommend using Miniforge mamba to manage the packages.
Specifically, we build the conda environment with the following command.
mamba create -n linkpred -c bioconda -c nvidia -c pytorch -c pyg python=3.11 cuda-version=12.1 pytorch torchvision torchaudio pytorch-cuda=12.1 snakemake graph-tool scikit-learn numpy==1.23.5 numba scipy==1.10.1 pandas polars networkx seaborn matplotlib gensim ipykernel tqdm black faiss-gpu pyg pytorch-sparse python-igraph -y
pip install adabelief-pytorch==0.2.0
pip install GPUtil powerlaw
You can also use the environment.yml
file to create the conda environment.
mamba env create -f environment.yml
Additionally, we need the following custom packages to run the experiments.
These packages can be installed via pip as follows:
pip install git+https://github.com/skojaku/gnn-tools.git@v1.0
pip install git+https://github.com/skojaku/embcom.git@v1.01
And to install the LFR benchmark package:
git clone https://github.com/skojaku/LFR-benchmark
cd LFR-benchmark
python setup.py build
pip install -e .
We provide the snakemake file to run the experiments. Before running the snakemake, you must create a config.yaml
file under the reproduction/workflow/
directory.
data_dir: "data/"
small_networks: Fales
where data_dir
is the directory where all data will is located, and small_networks
is a boolean value indicating whether to run the experiments for the small networks for testing the code.
Once you have created the config.yaml
file, move under the reproduction/
directory and run the snakemake as follows:
snakemake --cores <number of cores> all
or conveniently,
nohup snakemake --cores <number of cores> all >log &
The Snakemake will preprocess the data, run the experiments, and generate the figures in reproduction/figs/
directory.
New networks can be added to the experiment by adding a new file to the reproduction/data/raw
directory.
The file should be in the edge list format with 2 columns representing the source and destination nodes of the network, e.g.,
1 2
1 3
1 4
where each row forms an edge between the source and destination nodes, and the node IDs should start from 1.