CSpider is a large Chinese dataset for complex and cross-domain semantic parsing and text-to-SQL task (natural language interfaces for relational databases). It is released with our EMNLP 2019 paper: A Pilot Study for Chinese SQL Semantic Parsing. This repo contains all code for evaluation, preprocessing, and all baselines used in our paper. Please refer to the task site for more general introduction and the leaderboard.
10/2019
We start a Chinese text-to-SQL task with the full dataset translated from Spider. The submission tutorial and our dataset can be found at our task site. Please follow it to get your results on the unreleased test data. Thank Tao Yu for sharing the test set with us.9/2019
The dataset used in our EMNLP 2019 paper is redivided based on the training and deveploment sets from Spider. The dataset can be downloaded from here. This dataset is just released to reproduce the results in our paper. To join the CSpider leaderboard and better compare with the original English results, please refer to our task site for full dataset.When you use the CSpider dataset, we would appreciate it if you cite the following:
@inproceedings{min2019pilot,
title={A Pilot Study for Chinese SQL Semantic Parsing},
author={Min, Qingkai and Shi, Yuefeng and Zhang, Yue},
booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
pages={3643--3649},
year={2019}
}
Our dataset is based on Spider, please cite it too.
conda install pytorch=0.2.0 -c pytorch
pip install -r requirements.txt
train.json
and dev.json
under chisp/data/char/
directory. To use word-based methods, please do the word segmentation first and put the json files under chisp/data/word/
directory.char_emb.txt
under chisp/embedding/
directory. This is generated from the Tencent multilingual embeddings for the cross-lingual word embeddings schema. To use monolingual embedding schema, step 2 is necessary.database
directory under chisp/
directory.train_gold.sql
and dev_glod.sql
under chisp/data/
directory.data
, database
and embedding
directory under chisp/
directory. And you can run all the experiments(step 2 is necessary) shown in our paper.models
directory contains all the pretrained models.chisp/embedding/glove.%dB.%dd.txt
python preprocess_data.py -s char|word
data/
contains:
char/
for character-based raw train/dev/test data, corresponding processed dataset and saved models can be found at char/generated_datasets
.word/
for word-based raw train/dev/test data, corresponding processed dataset and saved models can be found at word/generated_datasets
.train.py
is the main file for training. Use train_all.sh
to train all the modules (see below).test.py
is the main file for testing. It uses supermodel.py
to call the trained modules and generate SQL queries. In practice, use test_gen.sh
to generate SQL queries.evaluation.py
is for evaluation. It uses process_sql.py
. In practice, use evaluation.sh
to evaluate the generated SQL queries.Run train_all.sh
to train all the modules.
It looks like:
python train.py \
--data_root path/to/char/or/word/based/generated_data \
--save_dir path/to/save/trained/module \
--train_component <module_name> \
--emb_path path/to/embeddings
--col_emb_path path/to/corresponding/embeddings/for/column
Run test_gen.sh
to generate SQL queries.
test_gen.sh
looks like:
python test.py \
--test_data_path path/to/char/or/word/based/raw/dev/or/test/data \
--models path/to/trained/module \
--output_path path/to/print/generated/SQL \
--emb_path path/to/embeddings
--col_emb_path path/to/corresponding/embeddings/for/column
Run evaluation.sh
to evaluate generated SQL queries.
evaluation.sh
looks like:
python evaluation.py \
--gold path/to/gold/dev/or/test/queries \
--pred path/to/predicted/dev/or/test/queries \
--etype evaluation/metric \
--db path/to/database \
--table path/to/tables \
evalution.py
is from the general evaluation process in the Spider github page.
The implementation is based on SyntaxSQLNet. Please cite it too if you use this code.