ruc-datalab / ZeroNL2SQL

MIT License
20 stars 5 forks source link

ZeroNL2SQL

:thought_balloon: Introduction

This repository contains the code for our VLDB2024 paper “Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL”.

:open_file_folder: Data Preparation

Train data

mkdir data/
unzip src/datasets/kaggledbqa/kaggledbqa.zip -d data/
unzip src/datasets/drspider/drspider.zip -d data/
# Don't delete the original .zip file

:computer: Environment Preparation

PWC

Please refer to requirements.txt to download the relevant toolkits.

Prepare the following folders:

cd ZeroNL2SQL
mkdir logs
mkdir experimental_outputs/train/template_generator
mkdir experimental_outputs/train/aligner

:zap: Quick Start

Download models

Text-to-SQL inference

Use the following script to directly infer on the text-to-sql test set. This script will take four steps: 1. generate SQL template; 2. align (SELECT, STRUCTURE) with the user question; 3. prepare data for LLM inference; 4. text2sql using LLM.

CUDA_VISIBLE_DEVICES={gpu_id} bash scripts/infer_LLM_with_template.sh test_set_name your_openai_key

Note that we evaluate the text-to-SQL results using the test_suite_evaluation, and the evaluation results are presented in eval.output.

:open_hands: Train From Scratch

Train template generator

CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run.py configs/train_template_generator.json"

The best model will be saved at experimental_outputs/train/template_generator/BEST_MODEL/.

Train aligner

CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run_aligner.py configs/train_aligner.json"

The best model will be saved at experimental_outputs/train/aligner/checkpoint_best.pkl.

:speech_balloon:Citation

If our code is helpful to you, please cite our work:

@misc{gu2023interleaving,
      title={Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation}, 
      author={Zihui Gu and Ju Fan and Nan Tang and Songyue Zhang and Yuxin Zhang and Zui Chen and Lei Cao and Guoliang Li and Sam Madden and Xiaoyong Du},
      year={2023},
      eprint={2306.08891},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}