This project is a stand-alone implementation of all the current adaptive and progressive multidimensional indexing algorithms.
CMake to be installed and a C++11
compliant compiler. Python 3.7 and Jupyter Notebooks are necessary to run the scripted experiments and their respective plots.
We use three real datasets, each with their own respective workload, and one synthetic dataset following a uniform random distribution. For the synthetic dataset, we also provide 8 different synthetic workloads
All real datasets and workloads were made publically available here.
The Sloan Digital Sky Survey is a project to map the universe. Their data and queries are publicly available at their website. The data set we use here consists of two columns, era, and dec, from the photoobjall table with approximately 69 million tuples. The workload consists of 100,000 real range queries executed on those two attributes.
The power benchmark consists of sensor data collected from a manufacturing installation, obtained from the DEBS 2012 challenge. The data set has three dimensions and 10 million tuples. The workload consists of random close-range queries on each dimension.
The 1000 Genomes Project collects data regarding human genomes. It consists of 10 million genomes, described in 19 dimensions. The workload consists of queries performed by bio-informaticians.
The synthetic data set follows a uniform data distribution for each attribute in the table, consisting of 4-byte floating-point numbers in the range of [0, 3 10^7), where 3 10^7 is the number of tuples. We use eight different synthetic workloads in our performance comparison. All workloads consist of queries with filters in all available dimensions, the follow the below form:
SELECT SUM(R.A) FROM R WHERE (R.A BETWEEN a_low AND a_high) and (R.B BETWEEN b_low AND b_high) ...
The values for low and high are chosen based on the workload pattern. The different workload patterns and their mathematical description are depicted below.
First, we compile the code using release (-O3) mode
cmake -DCMAKE_BUILD_TYPE=Release && make
For the real data, you must execute the download_datasets.py script in the root directory to automatically download all datasets and workloads. You need approximately 50Gb of free disk space.
./python3 download_datasets.py
All experiments are inside the experiments folder. There are 3 main sets of experiments. Note that unless stated otherwise, all algorithms run with min_partition_size = 1024, and all progressive indexing algorithms run with delta = 0.2.
cd experiments/real-data-workload
python3 run.py
Beforing plotting, make sure you have the necessary packages installed: First make sure you have pip installed:
pip --version
Should return version information about pip.
Next, install the necessary packages to plot the figures and tables:
pip install jupyterlab numpy pandas plotly matplotlib
After running each experiment, each folder contains a python notebook you can use to plot and analyze the results. For that, you must initiate the jupyter notebook server at the desired folder
jupyter notebook