pfl-cs / ALECE

An Attention-based Learned Cardinality Estimator for SPJ Queries on Dynamic Workloads
19 stars 8 forks source link

ALECE: An Attention-based Learned Cardinality Estimator for SPJ Queries on Dynamic Workloads

We propose an Attention-based LEarned Cardinality Estimator (ALECE~for short) mainly for SPJ queries. The core idea is to discover the implicit relationships between queries and underlying data by using attention mechanisms in ALECE's two modules built on top of carefully designed featurizations for data and queries. In particular, the data-encoder module makes organic aggregations among all attributes in the database and learn their joint distribution information, whereas the query-analyzer module builds a bridge between the query and data featurizations.

Requirements

- Python 3.8+
- Tensorflow 2.10, numpy, scipy, psycopg2, argparse
- gcc 9+, openssl, cmake, readline

Dataset

- cd ALECE && mkdir data
- Download the STATS dataset (https://drive.google.com/file/d/1XR-xzQ1Ablqv5-LInRPp67PeXbeX_s4c/view?usp=sharing)
- Put STATS.tar.gz in data/;
- tar zxvf STATS.tar.gz

First thing to use the code

Benchmark Preparation

Install PostgreSQL (in Linux):

- cd ALECE
- wget https://ftp.postgresql.org/pub/source/v13.1/postgresql-13.1.tar.bz2
- tar xvf postgresql-13.1.tar.bz2 && cd postgresql-13.1
- patch -s -p1 < ../pg_modify.patch
- ./configure --prefix=/usr/local/pgsql/13.1 --enable-depend --enable-cassert --enable-debug CFLAGS="-ggdb -O0"
- make -j 64 && sudo make install
- echo 'export PATH=/usr/local/pgsql/13.1/bin:$PATH' >> ~/.bashrc
- echo 'export LD_LIBRARY_PATH=/usr/local/pgsql/13.1/lib/:$LD_LIBRARY_PATH' >> ~/.bashrc
- source ~/.bashrc

How to Generate Sub-Queries and Workload?

- In data/STATS/workload/static, you will find two sql files: train_queries.sql and test_queries.sql.
- python benchmark/sub_queries_generator.py --data STATS --wl_data_type train (for train_queries.sql) 
- python benchmark/sub_queries_generator.py --data STATS --wl_data_type train (for test_queries.sql)
- We have provided three dynamic workloads: Insert-heavy, Update-heavy and Dist-shift. 
- Each workload is the mix of the training (sub-)queries, testing (sub-)queries and insert/delete/update statements.
- You can also randomly mix the training/testing (sub-)queries with data manipulation statements to build your own dynamic workload.

Citation

@article{li2023alece,
  author       = {Pengfei Li and
                  Wenqing Wei and
                  Rong Zhu and
                  Bolin Ding and
                  Jingren Zhou and
                  Hua Lu},
  title        = {{ALECE}: An Attention-based Learned Cardinality Estimator for {SPJ}
                  Queries on Dynamic Workloads},
  journal   = {Proc. {VLDB} Endow.},
  volume    = {17},
  number    = {2},
  pages     = {197--210},
  year      = {2023}
}

ALECE Training

- cd ALECE/src;
- python train.py --model ALECE --data STATS --wl_type ins_heavy --test_wl_type dist_shit

       Then, in the directory ALECE/exp/STATS/e2e, you will find a file ALECE_STATS_ins_dist.txt which covers the estimated cardinalities of the testing sub-queries in the Dist-shift workload. It is noteworthy that the training dataset comes from another workload.

End-to-end Evaluation with Our Benchmark

- cd ALECE/src;
- python benchmark/e2e_eval.py --model ALECE --data STATS --wl_type ins_heavy