nd7141 / bgnn

MIT License
168 stars 32 forks source link

Boosted Graph Neural Networks

The code and data for the ICLR 2021 paper: Boost then Convolve: Gradient Boosting Meets Graph Neural Networks

This code contains implementation of the following models for graphs:

Installation

To run the models you have to download the repo, install the requirements, and extract the datasets.

First, let's create a python environment:

mkdir envs
cd envs
python -m venv bgnn_env
source bgnn_env/bin/activate
cd ..

Second, let's download the code and install requirements

git clone https://github.com/nd7141/bgnn.git 
cd bgnn
unzip datasets.zip
make install

Next we need to install a proper version of PyTorch and DGL, depending on the cuda version of your machine. We strongly encourage to use GPU-supported versions of DGL (the speed up in training can be 100x).

First, determine your cuda version with nvcc --version. Then, check installation instructions for pytorch. For example for cuda version 9.2, install it as follows:

pip install torch==1.7.1+cu92 torchvision==0.8.2+cu92 torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

If you don't have GPU, use the following:

pip install torch==1.7.1+cpu torchvision==0.8.2+cpu torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html

Similarly, you need to install DGL library. For example, cuda==9.2:

pip install dgl-cu92

For cpu version of DGL:

pip install dgl

Tested versions of torch and dgl are:

Running

Starting point is file scripts/run.py:

python scripts/run.py dataset models 
    (optional) 
            --save_folder: str = None
            --task: str = 'regression',
            --repeat_exp: int = 1,
            --max_seeds: int = 5,
            --dataset_dir: str = None,
            --config_dir: str = None

Available options for dataset:

Available options for models are catboost, lightgbm, gnn, resgnn, bgnn, all.

Each model is specifed by its config. Check configs/ folder to specify parameters of the model and run.

Upon completion, the results wil be saved in the specifed folder (default: results/{dataset}/day_month/). This folder will contain aggregated_results.json, which will contain aggregated results for each model. Each model will have 4 numbers in this order: mean metric (RMSE or accuracy), std metric, mean runtime, std runtime. File seed_results.json will have results for each experiment and each seed. Additional folders will contain loss values during training.


Examples

The following script will launch all models on House dataset.

python scripts/run.py house all

The following script will launch CatBoost and GNN models on SLAP classification dataset.

python scripts/run.py slap catboost gnn --task classification

The following script will launch LightGBM model for 5 splits of data, repeating each experiment for 3 times.

python scripts/run.py vk lightgbm --repeat_exp 3 --max_seeds 5

The following script will launch resgnn and bgnn models saving results to custom folder.

python scripts/run.py county resgnn bgnn --save_folder ./county_resgnn_bgnn

Running on your dataset

To run the code on your dataset, it's necessary to prepare the files in the right format.

You can check examples in datasets/ folder.

There should be at least X.csv (node features), y.csv (target labels), graph.graphml (graph in graphml format).

Make sure to keep these filenames for your dataset.

You can also have cat_features.txt specifying names of categorical columns.

You can also have masks.json specifying train/val/test splits.

After that run the script as usual:

python scripts/run.py path/to/your/dataset gnn catboost 

Citation

@inproceedings{
ivanov2021boost,
title={Boost then Convolve: Gradient Boosting Meets Graph Neural Networks},
author={Sergei Ivanov and Liudmila Prokhorenkova},
booktitle={International Conference on Learning Representations (ICLR)},
year={2021},
url={https://openreview.net/forum?id=ebS5NUfoMKL}
}