hyrise / rl_index_selection

Paper repository for "SWIRL: Selection of Workload-aware Indexes using Reinforcement Learning" (EDBT 2022)
Other
28 stars 17 forks source link

SWIRL: Selection of Workload-aware Indexes using Reinforcement Learning

This repository provides some additional experimental data for the EDBT 2022 paper SWIRL: Selection of Workload-aware Indexes using Reinforcement Learning and the source code for SWIRL. The repository is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

If you have any questions, feel free to contact the authors, e.g., Jan Kossmann via jan.kossmann@hpi.de

Setup

The provided setup was tested with Python 3.7.9 (due to Tensorflow version dependencies) and PostgreSQL (12.5) only. The presented implementation requires several python libraries that are listed in the requirements.txt. Furthermore, there are two submodules:

  1. StableBaselines v2 for the RL algorithms. The submodule includes a modified version to enable invalid action masking.
  2. The index selection evaluation platform in a slightly modified version to simplify RL experiments. The platform handles hypothetical indexes and data generation and loading (adding -O2 to the Makefiles of the tpch-kit and tpcds-kit might speedup this process, see Miscellaneous below).

Please refer to the install script and the README of the index selection evaluation platform before proceeding.

Example workflow and model training

git submodule update --init --recursive # Fetch submodules
python3.7 -m venv venv                  # Create virtualenv
source venv/bin/activate                # Activate virtualenv
pip install -r requirements.txt         # Install requirements with pip
python -m swirl experiments/tpch.json   # Run TPC-H example experiment

Experiments can be controlled with the (mostly self-explanatory) json-file. There is another example file in the experiments folder. Results will be written into a configurable folder, for the test experiments it is set to experiment_results. If you want to use tensoboard for logging, create the necessary folder: mkdir tensor_log.

The index selection evaluation platform allows generating and loading TPC-H and TPC-DS benchmark data. It is recommended to populate a PostgreSQL instance via the platform with the benchmark data before executing the experiments. However, if the requested data (benchmark data and scale factor) is not already present, the experiment should generate and load the data but this functionality is not tested well.

For descriptions of the components and functioning, consult our EDBT paper. Query files were reduced to 10 queries per template for efficiency reasons.

DRLinda as an RL-based Competitor

This repository will also contain a reimplementation of the reinforcement learning index selection approach DRLinda based on Sadri et al.'s publications shown below. The reimplementation consists of the following classes: DRLindaActionManager in action_manager.py, DRLindaObservationManager in observation_manager.py, DRLindaReward in reward_calculator.py, and a specialized environment in /gym_db/envs/db_env_v3.py. Results of comparisons with DRLinda are presented in the paper.

We describe our attempt to DRLinda with Lan et al.'s solution to achieve multi-attribute index support in experiments/drlinda_multi_attribute/.

Referenced publications

JSON Configuration files

The experiments and models are configured via JSON files. For examples, check the .json files in the experiments. In the following, we explain the different configuration options:

To be continued...

Miscellaneous

For the index_selection_evaluation platform's submodules: Before loading TPC-DS data, build the TPC-DS kit with optimization to speed up table generation:

@@ -56,7 +56,7 @@ CC            = $($(OS)_CC)
 # CFLAGS
 AIX_CFLAGS             = -q64 -O3 -D_LARGE_FILES
 HPUX_CFLAGS            = -O3 -Wall
-LINUX_CFLAGS   = -g -Wall
+LINUX_CFLAGS   = -O2 -g -Wall

This is similar for the TPC-H kit.