taoyds / sparc

scripts and baselines for SParC: Yale & Salesforce Semantic Parsing and Text-to-SQL in Context Challenge
https://yale-lily.github.io/sparc
73 stars 12 forks source link

SParC: Cross-Domain Semantic Parsing in Context

SParC is a large dataset for complex, cross-domain, and context-dependent(multi-turn) semantic parsing and text-to-SQL task (interactive natural language interfaces for relational databases) built upon our Spider task. It is released along with our ACL 2019 paper: SParC: Cross-Domain Semantic Parsing in Context. This repo contains all code for evaluation and baselines used in our paper. Please refer to the SParC task page for more general introduction and the leaderboard.

:+1: 03/20/2022: We open-sourced a simple but SOTA model (just T5) for the task! Please check out our code in the UnifiedSKG repo!!

Changelog

Citation

The dataset is annotated by 14 college students. When you use the SParC dataset, we would appreciate it if you cite the following:

@InProceedings{Yu&al.19,
  title     = {SParC: Cross-Domain Semantic Parsing in Context},
  author    = {Tao Yu and Rui Zhang and Michihiro Yasunaga and Yi Chern Tan and Xi Victoria Lin and Suyi Li and Heyang Er, Irene Li and Bo Pang and Tao Chen and Emily Ji and Shreya Dixit and David Proctor and Sungrok Shim and Jonathan Kraft, Vincent Zhang and Caiming Xiong and Richard Socher and Dragomir Radev},
  booktitle = {Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics},
  year      = {2019},
  address   = {Florence, Italy},
  publisher = {Association for Computational Linguistics}
}

@inproceedings{Yu&al.18c,
  title     = {Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task},
  author    = {Tao Yu and Rui Zhang and Kai Yang and Michihiro Yasunaga and Dongxu Wang and Zifan Li and James Ma and Irene Li and Qingning Yao and Shanelle Roman and Zilin Zhang and Dragomir Radev}
  booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
  address   = "Brussels, Belgium",
  publisher = "Association for Computational Linguistics",
  year      = 2018
}

Installation

evaluation.py and process_sql.py were tested in Python 2.7.

Data Content and Format

Question, SQL, and Parsed SQL

Each file intrain.json and dev.json contains the following fields:

{
        "database_id": "car_1", 
        "interaction": [
            {
                "query": "Select T1.id, T2.model from cars_data as T1 join car_names as T2 on T1.id = T2.MakeId where T1.year = '1970';", 
                "utterance_toks": [
                    "What", 
                    "are", 
                    "the", 
                    ...
                    "?"
                ], 
                "utterance": "What are the ids, and models of the cars were made in 1970?", 
                "sql": {
                    "orderBy": [], 
                    "from": {
                        "table_units": [
                            [
                                "table_unit", 
                                5
                            ...
                    }, 
                    "union": null, 
                    "except": null, 

            }, 
            {
                "query": "Select T1.horsepower, T1.mpg, T1.id, T2.model from cars_data as T1 join car_names as T2 on T1.id = T2.MakeId where T1.year = '1970';", 
                "utterance_toks": [
                    "Show", 
                    ...
                    "well", 
                    "?"
                ], 
                "utterance": "Show their horsepower and MPG as well?", 
                "sql": {
                    "orderBy": [], 
                    "from": {
                        ...
                    ]
                }
            }, 
            {
                "query": "SELECT T1.Maker, T4.horsepower, T4.mpg, T4.id, T2.model FROM CAR_MAKERS AS T1 JOIN MODEL_LIST AS T2 ON T1.Id  =  T2.Maker JOIN CAR_NAMES AS T3 ON T2.model  =  T3.model JOIN CARS_DATA AS T4 ON T3.MakeId  =  T4.id WHERE T4.year  =  '1970';", 
                "utterance_toks": [
                    "Also", 
                    ...
                    "makers", 
                    "!"
                ], 
                "utterance": "Also provide the names of their makers!", 
                "sql": {
                    "orderBy": [], 
                    "from": {
                        ...
                }
            }, 
            {
                "query": "SELECT DISTINCT T1.Maker FROM CAR_MAKERS AS T1 JOIN MODEL_LIST AS T2 ON T1.Id  =  T2.Maker JOIN CAR_NAMES AS T3 ON T2.model  =  T3.model JOIN CARS_DATA AS T4 ON T3.MakeId  =  T4.id WHERE T4.year  =  '1970';", 
                "utterance_toks": [
                    "Just", 
                    "show", 
                    ...
                    "makers", 
                    "."
                ], 
                "utterance": "Just show a unique list of all these different makers.", 
                "sql": {
                    ...
                    "union": null, 
                    "except": null, 
                    "having": [], 
                    "limit": null, 
                    "intersect": null, 
                    "where": [
                        [
                            false, 
                           ...
            }
        ], 
        "final": {
            "query": "SELECT DISTINCT T1.Maker FROM CAR_MAKERS AS T1 JOIN MODEL_LIST AS T2 ON T1.Id  =  T2.Maker JOIN CAR_NAMES AS T3 ON T2.model  =  T3.model JOIN CARS_DATA AS T4 ON T3.MakeId  =  T4.id WHERE T4.year  =  '1970';", 
            "utterance": "Find the name of the makers that produced some cars in the year of 1970?"
        }, 
}

Tables

tables.json contains the following information for each database:

{
    "column_names": [
      [
        0,
        "id"
      ],
      [
        0,
        "name"
      ],
      [
        0,
        "country code"
      ],
      [
        0,
        "district"
      ],
      .
      .
      .
    ],
    "column_names_original": [
      [
        0,
        "ID"
      ],
      [
        0,
        "Name"
      ],
      [
        0,
        "CountryCode"
      ],
      [
        0,
        "District"
      ],
      .
      .
      .
    ],
    "column_types": [
      "number",
      "text",
      "text",
      "text",
         .
         .
         .
    ],
    "db_id": "world_1",
    "foreign_keys": [
      [
        3,
        8
      ],
      [
        23,
        8
      ]
    ],
    "primary_keys": [
      1,
      8,
      23
    ],
    "table_names": [
      "city",
      "sqlite sequence",
      "country",
      "country language"
    ],
    "table_names_original": [
      "city",
      "sqlite_sequence",
      "country",
      "countrylanguage"
    ]
  }

Databases

All table contents are contained in corresponding SQLite3 database files.

Data Process

Please refer to the Spider Github page for parsing SQL queries .

Evaluation

We follow the Spider evaluation methods to compute Component Matching, Exact Set Matching, and Execution Accuracies. Check out more details at the Spider Github page.

You can find some evaluation examples here.

python evaluation.py --gold [gold file] --pred [predicted file] --etype [evaluation type] --db [database dir] --table [table file]

arguments:
  [gold file]        gold.txt file where each line is `a gold SQL \t db_id`, and interactions are seperated by one empty line.
  [predicted file]   predicted sql file where each line is a predicted SQL, and interactions are seperated by one empty line.
  [evaluation type]  "match" for exact set matching score, "exec" for execution score, and "all" for both.
  [database dir]     directory which contains sub-directories where each SQLite3 database is stored.
  [table file]       table.json file which includes foreign key info of each database.

Baseline Models

Please go to this Github page for the Pytorch implementation of the CD-Seq2Seq baseline that can run on both SParC and ATIS tasks.

FAQ