ZeroNL2SQL

:thought_balloon: Introduction

This repository contains the code for our VLDB2024 paper “Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL”.

:open_file_folder: Data Preparation

Train data

Spider: Put it under src/datasets/spider.
Test data
KaggleDBQA: Put it under src/datasets/kaggledbqa.
Dr.Spider: Put it under src/datasets/drspider.

mkdir data/
unzip src/datasets/kaggledbqa/kaggledbqa.zip -d data/
unzip src/datasets/drspider/drspider.zip -d data/
# Don't delete the original .zip file

:computer: Environment Preparation

Please refer to requirements.txt to download the relevant toolkits.

Prepare the following folders:

cd ZeroNL2SQL
mkdir logs
mkdir experimental_outputs/train/template_generator
mkdir experimental_outputs/train/aligner

:zap: Quick Start

Download models

Template Generator: Put it under experimental_outputs/train/template_generator.
Aligner: Put it under experimental_outputs/train/aligner.

Text-to-SQL inference

Use the following script to directly infer on the text-to-sql test set. This script will take four steps: 1. generate SQL template; 2. align (SELECT, STRUCTURE) with the user question; 3. prepare data for LLM inference; 4. text2sql using LLM.

CUDA_VISIBLE_DEVICES={gpu_id} bash scripts/infer_LLM_with_template.sh test_set_name your_openai_key

The first argument is the name of the test set, which can be selected from kaggledbqa, DB_DBcontent_equivalence, DB_schema_abbreviation, DB_schema_synonym, NLQ_keyword_synonym, NLQ_keyword_carrier, NLQ_column_synonym, NLQ_column_carrier, NLQ_column_attribute, NLQ_column_value, NLQ_value_synonym, NLQ_multitype, NLQ_others, SQL_comparison, SQL_sort_order, SQL_NonDB_number, SQL_DB_text, SQL_DB_number.
The second argument is your openai key, which you can obtain from the official website.

Note that we evaluate the text-to-SQL results using the test_suite_evaluation, and the evaluation results are presented in eval.output.

:open_hands: Train From Scratch

Train template generator

CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run.py configs/train_template_generator.json"

The best model will be saved at experimental_outputs/train/template_generator/BEST_MODEL/.

Train aligner

CUDA_VISIBLE_DEVICES={gpu_id} bash -c "python src/run_aligner.py configs/train_aligner.json"

The best model will be saved at experimental_outputs/train/aligner/checkpoint_best.pkl.

:speech_balloon:Citation

If our code is helpful to you, please cite our work:

@misc{gu2023interleaving,
      title={Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation}, 
      author={Zihui Gu and Ju Fan and Nan Tang and Songyue Zhang and Yuxin Zhang and Zui Chen and Lei Cao and Guoliang Li and Sam Madden and Xiaoyong Du},
      year={2023},
      eprint={2306.08891},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

ruc-datalab / ZeroNL2SQL

readme