This repository contains the code for our VLDB2024 paper “Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL”.
src/datasets/spider
.
src/datasets/kaggledbqa
.src/datasets/drspider
.mkdir data/
unzip src/datasets/kaggledbqa/kaggledbqa.zip -d data/
unzip src/datasets/drspider/drspider.zip -d data/
# Don't delete the original .zip file
Please refer to requirements.txt
to download the relevant toolkits.
Prepare the following folders:
cd ZeroNL2SQL
mkdir logs
mkdir experimental_outputs/train/template_generator
mkdir experimental_outputs/train/aligner
experimental_outputs/train/template_generator
.experimental_outputs/train/aligner
.Use the following script to directly infer on the text-to-sql test set. This script will take four steps: 1. generate SQL template; 2. align (SELECT, STRUCTURE) with the user question; 3. prepare data for LLM inference; 4. text2sql using LLM.
CUDA_VISIBLE_DEVICES={gpu_id} bash scripts/infer_LLM_with_template.sh test_set_name your_openai_key
kaggledbqa
, DB_DBcontent_equivalence
, DB_schema_abbreviation
, DB_schema_synonym
, NLQ_keyword_synonym
, NLQ_keyword_carrier
, NLQ_column_synonym
, NLQ_column_carrier
, NLQ_column_attribute
, NLQ_column_value
, NLQ_value_synonym
, NLQ_multitype
, NLQ_others
, SQL_comparison
, SQL_sort_order
, SQL_NonDB_number
, SQL_DB_text
, SQL_DB_number
. Note that we evaluate the text-to-SQL results using the test_suite_evaluation, and the evaluation results are presented in eval.output
.
CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run.py configs/train_template_generator.json"
The best model will be saved at experimental_outputs/train/template_generator/BEST_MODEL/
.
CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run_aligner.py configs/train_aligner.json"
The best model will be saved at experimental_outputs/train/aligner/checkpoint_best.pkl
.
If our code is helpful to you, please cite our work:
@misc{gu2023interleaving,
title={Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation},
author={Zihui Gu and Ju Fan and Nan Tang and Songyue Zhang and Yuxin Zhang and Zui Chen and Lei Cao and Guoliang Li and Sam Madden and Xiaoyong Du},
year={2023},
eprint={2306.08891},
archivePrefix={arXiv},
primaryClass={cs.CL}
}