This is code to reproduce the results in the paper Supervised Learning on Relational Databases with Graph Neural Networks.
The file docker/whole_project/environment.yml
lists all dependencies you need to install to run this code.
You can follow the instructions here to automatically install a conda environment from this file.
You can also build a docker container which contains all dependencies. You'll need docker (or nvidia-docker if you want to use a GPU) installed to do this.
The file docker/whole_project/Dockerfile
builds a container that can run all experiments.
I would love to have a link here where you could just download the prepared datasets. But unfortunately that would violate the Kaggle terms of service.
So you either need to follow the instructions below and build them yourself, or reach out to me by email and I may be able to provide them to you.
1) Set the data_root
variable in /__init__.py
to be the location where you'd like to install the datasets. Default is <HOME>/RDB_data
.
2) Download raw dataset files from Kaggle. You need a Kaggle account to do this. You only need to download the datasets you're interested in.
a) Put the [Acquire Valued Shoppers Challenge](https://www.kaggle.com/c/acquire-valued-shoppers-challenge/data) data in `data_root/raw_data/acquirevaluedshopperschallenge`. Extract any compressed files.
b) Put the [Home Credit Default Risk](https://www.kaggle.com/c/home-credit-default-risk/data) data in `data_root/raw_data/homecreditdefaultrisk`. Extract any compressed files.
c) Put the [KDD Cup 2014](https://www.kaggle.com/c/kdd-cup-2014-predicting-excitement-at-donors-choose/data) data in `data_root/raw_data/kddcup2014`. Extract any compressed files.
3) Build the docker container specified in docker/neo4j/Dockerfile
. This creates a container with the neo4j graph database installed, which is used to build the datasets.
4) Start the database server(s) for the datasets you want to build:
docker run -d -e "NEO4J_dbms_active__database=<db_name>.graph.db" --publish=7474:<port_for_browser> --publish=7687:<port_for_db> --mount type=bind,source=<path_to_code>/data/datasets/<db_name>,target=/data rdb-neo4j
where <path_to_code>
is the location of this repo on your system, <port_for_browser>
is an optional port for using the build-in neo4j data viewer (you can set it as 7474
if you don't care), and (<db_name>
, <port_for_db>
) are (acquirevaluedshopperschallenge
, 9687
), (homecreditdefaultrisk
, 10687
), or (kddcup2014
, 7687
), respectively.
5) Run python -m data.<db_name>.build_database_from_kaggle_files
from the root directory of this repo.
6) (optional) To view the dataset in the built-in neo4j data viewer, navigate to <your_machine's_ip_address>:7474
in a web browser, run :server disconnect
to log off whatever your web browser thinks is the default neo4j server, and log into the right one by specifying <port_for_browser>
in the web interface.
7) Run python -m data.<db_name>.build_dataset_from_database
from the root directory of this repo.
8) (optional) Run python -m data.<db_name>.build_db_info
from the root directory of this repo.
9) (optional) to create the tabular and DFS datasets used in the experiments, run python -m data.<db_name>.build_DFS_features
from the root directory of this repo. Then run python -m data.<db_name>.build_tabular_datasets
from the root directory of this repo.
If you have your own relational dataset you'd like to use this system with, you can copy and modify the code in one of the data/acquirevaluedshopperschallenge
, data/homecreditdefaultrisk
, or data/kddcup2014
directories to suit your purposes.
The main thing you have to do is create the .cypher
script to get your data into a neo4j database. Once you've done that, nearly all the dataset building code is reusable.
You'll also have to add your dataset's name in a few places in the codebase, e.g. in the __init__
method of the DatabaseDataset
class.
All experiments are started with the scripts in the experiments
directory.
For example, to recreate the PoolMLP
row in paper tables 3 and 4, you would run python -m experiments.GNN.PoolMLP
from the root directory of this repo to start training, then run python -m experiments.evaluate_experiments
when training is finished, and finally run python -m experiments.GNN.print_and_plot_results
.
By default, experiments run in tmux windows on your local machine. But you can also change the argument in the run_script_with_kwargs
command at the bottom of each experiment script to run them in a local docker container.
Or you can export the docker image built with docker/whole_project/Dockerfile
to AWS ECR and modify the arguments in experiments/utils/run_script_with_kwargs
to run all experiments on AWS Batch.
The content of the notes linked above is licensed under the Creative Commons Attribution 3.0 license, and the code in this repo is licensed under the MIT license.