Spider: Yale Semantic Parsing and Text-to-SQL Challenge
DESCRIPTION:
Spider 1.0
Yale Semantic Parsing and Text-to-SQL Challenge
What is Spider?
Feb. 5th, 2024: We will no longer accept submissions for Spider 1.0 evaluations or update its leaderboard. Look forward to the release of Spider 2.0, a more realistic and challenging benchmark in the era of LLMs, expected this March. Stay tuned!
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. In Spider 1.0, different complex SQL queries and databases appear in train and test sets. To do well on it, systems must generalize well to not only new SQL queries but also new database schemas.
Why we call it "Spider"?
It is because our dataset is complex and cross-domain like a spider crawling across multiple complex (with many foreign keys) nests (databases).
Related works: DS-1000, Binder, UnifiedSKG, multi-turn SParC, and conversational CoSQL text-to-SQL tasks.
News
02/05/2024
We will no longer accept submissions for Spider 1.0 evaluations or update its leaderboard. The test set of Spider 1.0 has already been released (check the Spider dataset link below). Look forward to the release of Spider 2.0, a more realistic and challenging benchmark in the era of LLMs, expected this March. Stay tuned!
08/10/2023 Please check out XLang language model agents!
05/27/2023 Please check out Dr.Spider, a robustness evaluation benchmark based on Spider, from AWS AI Lab for studying robustness in semantic parsing!
11/20/2022 Please check out our recent work DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. Please check out examples, data, and code on the DS-1000 project site!!
10/18/2022 Please check out our recent work Binder: an easy but sota neural-symbolic built on GPT-3 Codex & SQL/Python interpreter. It injects GPT-3 Codex prompt API calls in programming languages! Please check out Binder demo, code, paper, and video on the Binder project site!!
01/18/2022 Please check out our recent work UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. We open-sourced simple but SOTA/strong models for 21 tasks including text-to-SQL! Please check out our code in the UnifiedSKG repo!!
03/11/2021 Please check out a nice work from Google Research (including new Spider splits) for studying compositional generalization in semantic parsing!
11/15/2020 We will use Test Suite Accuracy as our official evaluation metric for Spider, SParC, and CoSQL. Please find the evaluation code from here. Also, Notice that Test results after May 02, 2020 are reported on the new release (collected some annotation errors).
08/03/2020 Corrected "column_name" and "column_name_original" mismatches in 2 dbs ("scholar" and "formula_1") in tables.json, and reparsed SQL queries (this only affects some models (e.g. RATSQL) which use our parsed SQL as the SQL input). Please download the Spider dataset from this page again.
06/07/2020 We corrected some annotation errors and label mismatches (not errors) in Spider dev and test sets (~4% of dev examples updated, click here for more details). Please download the Spider dataset from this page again.
01/16/2020 For value prediction (in order to compute the execution accuracy), your model should be able to 1) copy from the question inputs, 2) retrieve from the database content (database content is available), or 3) generate numbers (e.g. 3 in "LIMIT 3").
9/24/2019 (Min et al., EMNLP 2019) translated Spider to Chinese! Check out the Chinese challenge page.
5/17/2019 Our paper SParC: Cross-Domain Semantic Parsing in Context with Salesforce Research was accepted to ACL 2019! It introduces the context-dependent version of the Spider challenge: SParC!
5/17/2019 Please report any annotation errors here, we really appreciate your help and will update the data release in this summer!
1/14/2019 The submission tutorial is out!.
12/17/2018 We updated 7 sqlite database files (issue 14). Please download the Spider dataset from this page again.
10/25/2018 The evaluation script and results were updated (issue 5). Please download the latest versions of the script and papers. Also, please follow instructions in issue 3 to generate the latest SQL parsing results (fixed a bug).
Why Spider?
As the above spider chart shows, Spider 1.0 is distinct from most of the previous semantic parsing tasks because:
ATIS, Geo, Academic: Each of them contains only a single database with a limited number of SQL queries, and has exact same SQL queries in train and test splits.
WikiSQL: The numbers of SQL queries and tables are significantly large. But all SQL queries are simple, and each database is only a simple table without any foreign key.
Spider 1.0 spans the largest area in the chart, making it the first complex and cross-domain semantic parsing and text-to-SQL dataset! Read more on the blog post.
Getting Started
The data is split into training, development, and test sets. Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):
Details of baseline models and evaluation script can be found on the following GitHub site:
Data Examples
Some examples look like the following:
Have Questions or Want to Contribute?
Ask us questions at our Github issues page or contact Tao Yu, Rui Zhang, or Michihiro Yasunaga.
We expect the dataset to evolve. We would greatly appreciate it if you could donate us your non-private databases or SQL queries for the project.
Acknowledgement
We thank Graham Neubig, Tianze Shi, Catherine Finegan-Dollak, and the anonymous reviewers for their precious comments on this project. Also, we thank Pranav Rajpurkar for giving us the permission to build this website based on SQuAD.
Our team at the summit of the East Rock park in New Haven (The pose is "NLseq2SQL"):
Leaderboard - Execution with Values
Our current models do not predict any value in SQL conditions so that we do not provide execution accuracies. However, we encourage you to provide it in the future submissions. For value prediction, your model should be able to 1) copy from the question inputs, 2) retrieve from the database content (database content is available), or 3) generate numbers (e.g. 3 in "LIMIT 3"). Notice: Test results after May 02, 2020 are reported on the new release (collected some annotation errors).
Rank
Model
Test
1
Nov 2, 2023 MiniSeek Anonymous Code and paper coming soon
91.2
1
Aug 20, 2023 DAIL-SQL + GPT-4 + Self-Consistency Alibaba Group (Gao and Wang et al.,'2023) code
86.6
2
Aug 9, 2023 DAIL-SQL + GPT-4 Alibaba Group (Gao and Wang et al.,'2023) code
86.2
3
October 17, 2023 DPG-SQL + GPT-4 + Self-Correction Anonymous Code and paper coming soon
85.6
4
Apr 21, 2023 DIN-SQL + GPT-4 University of Alberta (Pourreza et al.,'2023) code
85.3
5
July 5, 2023 Hindsight Chain of Thought with GPT-4 Anonymous Code and paper coming soon
83.9
6
Jun 1, 2023 C3 + ChatGPT + Zero-Shot Zhejiang University & Hundsun (Dong et al.,'2023) code
82.3
7
July 5, 2023 Hindsight Chain of Thought with GPT-4 and Instructions Anonymous Code and paper coming soon
80.8
8
Feb 7, 2023 RESDSQL-3B + NatSQL (DB content used) Renmin University of China (Li et al., AAAI'23) code
79.9
9
Nov 21, 2022 SeaD + PQL (DB content used) Anonymous
78.5
10
Apr 21, 2023 DIN-SQL + CodeX University of Alberta (Pourreza et al.,'2023) code
78.2
11
August 10, 2023 T5-3B+NatSQL+Token Preprocessing (DB content used) George Mason University & MIT (Rai et al., ACL '23) code
546: [2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction
### DetailsSimilarity score: 0.9
- [ ] [[2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction](https://arxiv.org/abs/2304.11015)
# [2304.11015] DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction
**DESCRIPTION:**
DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction
Mohammadreza Pourreza, Davood Rafiei
There is currently a significant gap between the performance of fine-tuned models and prompting approaches using Large Language Models (LLMs) on the challenging task of text-to-SQL, as evaluated on datasets such as Spider. To improve the performance of LLMs in the reasoning process, we study how decomposing the task into smaller sub-tasks can be effective. In particular, we show that breaking down the generation problem into sub-problems and feeding the solutions of those sub-problems into LLMs can be an effective approach for significantly improving their performance. Our experiments with three LLMs show that this approach consistently improves their simple few-shot performance by roughly 10%, pushing the accuracy of LLMs towards SOTA or surpassing it. On the holdout test set of Spider, the SOTA, in terms of execution accuracy, was 79.9 and the new SOTA at the time of this writing using our approach is 85.3. Our approach with in-context learning beats many heavily fine-tuned models by at least 5%. Additionally, when evaluated on the BIRD benchmark, our approach achieved an execution accuracy of 55.9%, setting a new SOTA on its holdout test set.
URL: [https://arxiv.org/abs/2304.11015](https://arxiv.org/abs/2304.11015)
#### Suggested labels
#### {'label-name': 'Text-to-SQL', 'label-description': 'Focuses on generating SQL queries from natural language text.', 'confidence': 76.74}
649: README.md · PipableAI/pip-sql-1.3b at main
### DetailsSimilarity score: 0.88
- [ ] [README.md · PipableAI/pip-sql-1.3b at main](https://huggingface.co/PipableAI/pip-sql-1.3b/blob/main/README.md?code=true)
# README.md · PipableAI/pip-sql-1.3b at main
**DESCRIPTION:**
- license: apache-2.0
- datasets:
- PipableAI/pip-txt-to-sql-spider-bird-dataset
- language:
- en
- metrics:
- accuracy
- tags:
- sql
- code
- text2sql
- instruction_tuned
- basemodel
- jax
- pytorch
- tensorflow
- text-generation-inference
- library_name: transformers
- pipeline_tag: text-generation
- widget:
- text: "CREATE TABLE system(JobID: String,GID: String, UID: String, Start:Time(yyyy/mm/dd), End: Time,ElapsedRaw: Time, CPUTimeRAW: Time,NCPUS: Number,NNodes: Number, NodeList: List, State:String, Timelimit: Time);Get UID and job id for Jobs that started on Jan 20 , 2023 ended on feb 14 2023 and has job id 20"
example_title: "example"
---
# pipSQL-1.3b
[pipableAi](https://www.linkedin.com/company/pipable.ai/about/)
[colab_notebook](https://colab.research.google.com/drive/1insSxvc3jjAXe0zmdIjmbG3ttb5mpRgQ?usp=sharing)
## What have we built?
A 1.3 bn SQL model that outperforms most SQL expert models and chatgpt on popular benchmarks.
This is a distilled model built on the deepseek base model.
## How we built it?
We used softmax cross entropy and a modified form of policy grad along with Q loss, optimized in an EM set up.
Loss behaviour in the set up mentioned above -
![image/png](https://cdn-uploads.huggingface.co/production/uploads/658d8095a2a6a6e0da8bb8a6/I80Ru1r4thoYrLagIWALa.png)
## Benchmarking:
For benchmarking purposes we are using Semantic Evaluation for Text-to-SQL with Distilled Test Suites, an officially accepted evaluation framework for Spider, SParC, and CoSQL which was proposed by a research team of Yale and Berkeley.
The benchmark contains 2200 test data points
Here is the link to run the evaluation:
[Test Suite SQL Eval](https://github.com/taoyds/test-suite-sql-eval)
|model|easy|medium|hard|extra|
|-----|----|------|----|-----|
|sqlcoder-7b-2|72.0|58.0|40.6|37.3|
|pipSQL-1.3b|78.5|57.5|42.1|28.3|
|pipSQL-7b|63.0|40.0|30.2|25.0|
|sqlcoder-7b|60.6|48.2|28.3|20.4|
|gpt-3.5|58.8|44.7|31.0|28.4|
We have also benchmarked it on defog eval.
It contains 200 test data points handpicked by defog team.
Here is the link to it:
[Defog SQL-Eval](https://github.com/defog-ai/sql-eval)
These are the results -
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64d32c6b921678fdc9de3302/fFeLSEYBNpQk_JWjFsF5M.png)
## License
The model is open source under apache 2.0. License
## Usage
### Installation
```bash
pip install transformers
```
### Prompt
```python
prompt = f"""{schema}{question}"""
```
### PyTorch
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b")
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('')[1].split('')[0])
```
### Flax
```python
from transformers import FlaxAutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = FlaxAutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b",from_pt=True)
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")
inputs = tokenizer(text, return_tensors="jax")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('')[1].split('')[0])
```
### TensorFlow
```python
from transformers import TFAutoModelForCausalLM, AutoTokenizer
device = "cuda"
model = TFAutoModelForCausalLM.from_pretrained("PipableAI/pip-sql-1.3b",from_pt=True)
tokenizer = AutoTokenizer.from_pretrained("PipableAI/pip-sql-1.3b")
inputs = tokenizer(text, return_tensors="tf")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True).split('')[1].split('')[0])
```
## Examples
### Schema
```sql
CREATE TABLE Products (
product_id number,
parent_product_id number,
product_name text,
product_price number,
product_color text,
product_size text,
product_description text);
CREATE TABLE Customers (
customer_id number,
gender_code text,
customer_first_name text,
customer_middle_initial text,
customer_last_name text,
email_address text,
login_name text,
login_password text,
phone_number text,
address_line_1 text,
town_city text,
county text,
country text);
CREATE TABLE Customer_Payment_Methods (
customer_id number,
payment_method_code text);
CREATE TABLE Invoices (
invoice_number number,
invoice_status_code text,
invoice_date time);
CREATE TABLE Orders (
order_id number,
customer_id number,
order_status_code text,
date_order_placed time);
CREATE TABLE Order_Items (
order_item_id number,
product_id number,
order_id number,
order_item_status_code text);
CREATE TABLE Shipments (
shipment_id number,
order_id number,
invoice_number number,
shipment_tracking_number text,
shipment_date time);
CREATE TABLE Shipment_Items (
shipment_id number,
order_item_id number);
```
### Questions
What are the email address, town and county of the customers who are of the least common gender?
```sql
SELECT email_address , town_city , county FROM customers GROUP BY gender_code ORDER BY count(*) ASC LIMIT 1
```
What are the product price and the product size of the products whose price is above average?
```sql
SELECT product_price , product_size FROM products WHERE product_price > (SELECT avg(product_price) FROM products)
```
Which customers did not make any orders? List the first name, middle initial and last name.
```sql
SELECT T1.customer_first_name , T1.customer_middle_initial , T1.customer_last_name FROM Customers AS T1 WHERE T1.customer_id NOT IN (SELECT T2.customer_id FROM Orders AS T2)
```
### Team
Avi Kothari, Pratham Gupta, Ritvik Aryan Kalra, Rohan Bhatial, Soham Acharya
URL: [PipableAI/pip-sql-1.3b](https://huggingface.co/PipableAI/pip-sql-1.3b/blob/main/README.md?code=true)
#### Suggested labels
####
386: SciPhi/AgentSearch-V1 · Datasets at Hugging Face
### DetailsSimilarity score: 0.86
- [ ] [SciPhi/AgentSearch-V1 · Datasets at Hugging Face](https://huggingface.co/datasets/SciPhi/AgentSearch-V1)
#### Getting Started
The AgentSearch-V1 dataset is a comprehensive collection of over one billion embeddings, produced using jina-v2-base. It includes more than 50 million high-quality documents and over 1 billion passages, covering a vast range of content from sources such as Arxiv, Wikipedia, Project Gutenberg, and includes carefully filtered Creative Commons (CC) data. Our team is dedicated to continuously expanding and enhancing this corpus to improve the search experience. We welcome your thoughts and suggestions – please feel free to reach out with your ideas!
To access and utilize the AgentSearch-V1 dataset, you can stream it via HuggingFace with the following Python code:
```python
from datasets import load_dataset
import json
import numpy as np
# To stream the entire dataset:
ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", streaming=True)
# Optional, stream just the "arxiv" dataset
# ds = load_dataset("SciPhi/AgentSearch-V1", data_files="**/*", split="train", data_files="arxiv/*", streaming=True)
# To process the entries:
for entry in ds:
embeddings = np.frombuffer(
entry['embeddings'], dtype=np.float32
).reshape(-1, 768)
text_chunks = json.loads(entry['text_chunks'])
metadata = json.loads(entry['metadata'])
print(f'Embeddings:\n{embeddings}\n\nChunks:\n{text_chunks}\n\nMetadata:\n{metadata}')
break
```
A full set of scripts to recreate the dataset from scratch can be found [here](https://github.com/SciPhi-AI/agent-search/tree/main/scripts). Further, you may check the docs for details on how to perform RAG over AgentSearch.
#### Languages
English.
#### Dataset Structure
The raw dataset structure is as follows:
```json
{
"url": ...,
"title": ...,
"metadata": {"url": "...", "timestamp": "...", "source": "...", "language": "..."},
"text_chunks": ...,
"embeddings": ...,
"dataset": "book" | "arxiv" | "wikipedia" | "stack-exchange" | "open-math" | "RedPajama-Data-V2"
}
```
#### Dataset Creation
This dataset was created as a step towards making humanities most important knowledge openly searchable and LLM optimal. It was created by filtering, cleaning, and augmenting locally publicly available datasets.
To cite our work, please use the following:
@software{SciPhi2023AgentSearch,
author = {SciPhi},
title = {AgentSearch [ΨΦ]: A Comprehensive Agent-First Framework and Dataset for Webscale Search},
year = {2023},
url = {https://github.com/SciPhi-AI/agent-search}
}
#### Source Data
@ONLINE{wikidump,
author = "Wikimedia Foundation",
title = "Wikimedia Downloads",
url = "https://dumps.wikimedia.org"
}
@misc{paster2023openwebmath,
title={OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text},
author={Keiran Paster and Marco Dos Santos and Zhangir Azerbayev and Jimmy Ba},
year={2023},
eprint={2310.06786},
archivePrefix={arXiv},
primaryClass={cs.AI}
}
@software{together2023redpajama,
author = {Together Computer},
title = {RedPajama: An Open Source Recipe to Reproduce LLaMA training dataset},
month = April,
year = 2023,
url = {https://github.com/togethercomputer/RedPajama-Data}
}
#### License
Please refer to the licenses of the data subsets you use.
- Open-Web ([Common Crawl Foundation Terms of Use](https://commoncrawl.org/terms-of-use/))
- Books: the\_pile\_books3 license and pg19 license
- ArXiv Terms of Use
- Wikipedia License
- StackExchange license on the Internet Archive
#### Suggested labels
#### { "key": "knowledge-dataset", "value": "A dataset with one billion embeddings from various sources, such as Arxiv, Wikipedia, Project Gutenberg, and carefully filtered Creative Commons data" }
333: Paper Digest: NeurIPS-2023 Highlights (Full List)
### DetailsSimilarity score: 0.86
- [ ] [Paper Digest: NeurIPS-2023 Highlights (Full List)](https://www.paperdigest.org/data/neurips-2023-full.html)
Paper Digest: NeurIPS 2023 Highlights
https://www.paperdigest.org
1, Toolformer: Language Models Can Teach Themselves to Use Tools
Timo Schick; Jane Dwivedi-Yu; Roberto Dessi; Roberta Raileanu; Maria Lomeli; Eric Hambro; Luke Zettlemoyer; Nicola Cancedda; Thomas Scialom;
Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View
Highlight: In this paper, we show that LMs can teach themselves to *use external tools* via simple APIs and achieve the best of both worlds.
2, Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan; Niket Tandon; Prakhar Gupta; Skyler Hallinan; Luyu Gao; Sarah Wiegreffe; Uri Alon; Nouha Dziri; Shrimai Prabhumoye; Yiming Yang; Shashank Gupta; Bodhisattwa Prasad Majumder; Katherine Hermann; Sean Welleck; Amir Yazdanbakhsh; Peter Clark;
Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View
Highlight: Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement.
3, Vicuna Evaluation: Exploring LLM-as-a-Judge and Chatbot Arena
Lianmin Zheng; Wei-Lin Chiang; Ying Sheng; Siyuan Zhuang; Zhanghao Wu; Yonghao Zhuang; Zi Lin; Zhuohan Li; Dacheng Li; Eric Xing; Hao Zhang; Joseph Gonzalez; Ion Stoica;
Related Papers Related Patents Related Grants Related Venues Related Experts View
Highlight: To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them.
#### Suggested labels
#### { "key": "LLM-Applications", "value": "Topics related to practical applications of Large Language Models in various fields" }
309: openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"
### DetailsSimilarity score: 0.85
- [ ] [openai/human-eval: Code for the paper "Evaluating Large Language Models Trained on Code"](https://github.com/openai/human-eval)
HumanEval: Hand-Written Evaluation Set
This is an evaluation harness for the HumanEval problem solving dataset described in the paper "Evaluating Large Language Models Trained on Code".
Installation
Make sure to use python 3.7 or later:
$ conda create -n codex python=3.7
$ conda activate codex
Check out and install this repository:
$ git clone https://github.com/openai/human-eval
$ pip install -e human-eval
Usage
This program exists to run untrusted model-generated code. Users are strongly encouraged not to do so outside of a robust security sandbox. The execution call in execution.py is deliberately commented out to ensure users read this disclaimer before running code in a potentially unsafe manner. See the comment in execution.py for more information and instructions.
After following the above instructions to enable execution, generate samples and save them in the following JSON Lines (jsonl) format, where each sample is formatted into a single line like so:
{"task_id": "Corresponding HumanEval task ID", "completion": "Completion only without the prompt"}
We provide example_problem.jsonl and example_solutions.jsonl under data to illustrate the format and help with debugging.
Here is nearly functional example code (you just have to provide generate_one_completion to make it work) that saves generated completions to samples.jsonl.
from human_eval.data import write_jsonl, read_problems
problems = read_problems()
num_samples_per_task = 200
samples = [
dict(task_id=task_id, completion=generate_one_completion(problems[task_id]["prompt"]))
for task_id in problems
for _ in range(num_samples_per_task)
]
write_jsonl("samples.jsonl", samples)
To evaluate the samples, run
$ evaluate_functional_correctness samples.jsonl
Reading samples...
32800it [00:01, 23787.50it/s]
Running test suites...
100%|...| 32800/32800 [16:11<00:00, 33.76it/s]
Writing results to samples.jsonl_results.jsonl...
100%|...| 32800/32800 [00:00<00:00, 42876.84it/s]
{'pass@1': ..., 'pass@10': ..., 'pass@100': ...}
This script provides more fine-grained information in a new file ending in _results.jsonl. Each row now contains whether the completion passed along with the execution result which is one of "passed", "timed out", or "failed".
As a quick sanity-check, the example samples should yield 0.5 pass@1.
$ evaluate_functional_correctness data/example_samples.jsonl --problem_file=data/example_problem.jsonl
Reading samples...
6it [00:00, 3397.11it/s]
Running example suites...
100%|...| 6/6 [00:03<00:00, 1.96it/s]
Writing results to data/example_samples.jsonl_results.jsonl...
100%|...| 6/6 [00:00<00:00, 6148.50it/s]
{'pass@1': 0.4999999999999999}
Because there is no unbiased way of estimating pass@k when there are fewer samples than k, the script does not evaluate pass@k for these cases. To evaluate with other k values, pass --k=. For other options, see
$ evaluate_functional_correctness --help
However, we recommend that you use the default values for the rest.
Known Issues
While evaluation uses very little memory, you might see the following error message when the system is running out of RAM. Since this may cause some correct programs to fail, we recommend that you free some memory and try again.
malloc: can't allocate region
Citation
Please cite using the following bibtex entry:
@article{chen2021codex,
title={Evaluating Large Language Models Trained on Code},
author={Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mikhail Pavlov and Alethea Power and Lukasz Kaiser and Mohammad Bavarian and Clemens Winter and Philippe Tillet and Felipe Petroski Such and Dave Cummings and Matthias Plappert and Fotios Chantzis and Elizabeth Barnes and Ariel Herbert-Voss and William Hebgen Guss and Alex Nichol and Alex Paino and Nikolas Tezak and Jie Tang and Igor Babuschkin and Suchir Balaji and Shantanu Jain and William Saunders and Christopher Hesse and Andrew N. Carr and Jan Leike and Josh Achiam and Vedant Misra and Evan Morikawa and Alec Radford and Matthew Knight and Miles Brundage and Mira Murati and Katie Mayer and Peter Welinder and Bob McGrew and Dario Amodei and Sam McCandlish and Ilya Sutskever and Wojciech Zaremba},
year={2021},
eprint={2107.03374},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
#### Suggested labels
#### { "key": "llm-evaluation", "value": "Evaluating Large Language Models performance and behavior through human-written evaluation sets" }
### DetailsSimilarity score: 0.85
- [ ] [cohereai_classify table | CohereAI plugin | Steampipe Hub](https://hub.steampipe.io/plugins/mr-destructive/cohereai/tables/cohereai_classify)
# TITLE: cohereai_classify table | CohereAI plugin | Steampipe Hub
**DESCRIPTION:**
Overview
8Tables
Versions
GitHub
steampipe plugin install mr-destructive/cohereai
**cohereai_classify**
cohereai_detect_language
cohereai_detokenize
cohereai_embed
cohereai_generation
cohereai_summaraize
cohereai_summarize
cohereai_tokenize
**ON THIS PAGE**
Examples
Schema
**GET INVOLVED**
Edit on GitHub
Discuss on Slack
**Table: cohereai_classify**
Get classification for a given input strings and examples.
Notes:
- A inputs is a list of strings to classify.(max 96 strings)
- A examples is a list of {"text": "apple", "label": "fruit"} structure of type Example
- Minimum 2 examples should be provided and the maximum value is 2500 with each example of maximum of 512 tokens.
**Examples**
Basic classification with given set of inputs and examples
```sql
select
classification
from
cohereai_classify
where
inputs = '["apple", "blue", "pineapple"]'
and examples = '[{"text": "apple", "label": "fruit"}, {"text": "green", "label": "color"}, {"text": "grapes", "label": "fruit"}, {"text": "purple", "label": "color"}]';
```
Classification with specific settings(model, preset)
```sql
select
classification
from
cohereai_classify
where
settings = '{
"model": "embed - multilingual - v2.0" }'
and inputs = '["Help!", "Call me when you can"]'
and examples = '[{"text": "Help!", "label": "urgent"}, {"text": "SOS", "label": "urgent"}, {"text": "Call me when you can", "label": "not urgent"}, {"text": "Talk later?", "label": "not urgent"}]';
```
Email Spam Classification
```sql
select
classification
from
cohereai_classify
where
inputs = '["Confirm your email address", "hey i need u to send some $"]'
and examples = '[{"label": "Spam", "text": "Dermatologists don't like her!"}, {"label": "Spam", "text": "Hello, open to this?"}, {"label": "Spam", "text": "I need help please wire me $1000 right now"}, {"label": "Spam", "text": "Hot new investment, don't miss this!"}, {"label": "Spam", "text": "Nice to know you ;)"}, {"label": "Spam", "text": "Please help me?"}, {"label": "Not spam", "text": "Your parcel will be delivered today"}, {"label": "Not spam", "text": "Review changes to our Terms and Conditions"}, {"label": "Not spam", "text": "Weekly sync notes"}, {"label": "Not spam", "text": "Re: Follow up from today's meeting"}, {"label": "Not spam", "text": "Pre-read for tomorrow"}]';
```
**Schema for cohereai_classify**
| Name | Type | Operators | Description |
|---------------|------------------|-----------|--------------------------------------------------------------|
| _ctx | jsonb | | Steampipe context in JSON form, e.g. connection_name. |
| classification| text | | The classification results for the given input text(s). |
| confidence | double precision | | The confidence score of the classification. |
| examples | text | | The example text classified. |
| id | text | | The ID of the classification. |
| inputs | text | | The input text that was classified. |
| labels | jsonb | | The labels of the classification. |
| settings | jsonb | | Settings is a JSONB object that accepts any of the classify API request parameters. |
**URL:** [cohereai_classify table | CohereAI plugin | Steampipe Hub](https://hub.steampipe.io/plugins/mr-destructive/cohereai/tables/cohereai_classify)
#### Suggested labels
####
Spider: Yale Semantic Parsing and Text-to-SQL Challenge
DESCRIPTION:
Spider 1.0
Yale Semantic Parsing and Text-to-SQL Challenge
What is Spider?
Feb. 5th, 2024: We will no longer accept submissions for Spider 1.0 evaluations or update its leaderboard. Look forward to the release of Spider 2.0, a more realistic and challenging benchmark in the era of LLMs, expected this March. Stay tuned!
Spider is a large-scale complex and cross-domain semantic parsing and text-to-SQL dataset annotated by 11 Yale students. The goal of the Spider challenge is to develop natural language interfaces to cross-domain databases. It consists of 10,181 questions and 5,693 unique complex SQL queries on 200 databases with multiple tables covering 138 different domains. In Spider 1.0, different complex SQL queries and databases appear in train and test sets. To do well on it, systems must generalize well to not only new SQL queries but also new database schemas.
Why we call it "Spider"?
It is because our dataset is complex and cross-domain like a spider crawling across multiple complex (with many foreign keys) nests (databases).
Related works: DS-1000, Binder, UnifiedSKG, multi-turn SParC, and conversational CoSQL text-to-SQL tasks.
News
02/05/2024
We will no longer accept submissions for Spider 1.0 evaluations or update its leaderboard. The test set of Spider 1.0 has already been released (check the Spider dataset link below). Look forward to the release of Spider 2.0, a more realistic and challenging benchmark in the era of LLMs, expected this March. Stay tuned!
08/10/2023 Please check out XLang language model agents!
05/27/2023 Please check out Dr.Spider, a robustness evaluation benchmark based on Spider, from AWS AI Lab for studying robustness in semantic parsing!
11/20/2022 Please check out our recent work DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation. Please check out examples, data, and code on the DS-1000 project site!!
10/18/2022 Please check out our recent work Binder: an easy but sota neural-symbolic built on GPT-3 Codex & SQL/Python interpreter. It injects GPT-3 Codex prompt API calls in programming languages! Please check out Binder demo, code, paper, and video on the Binder project site!!
01/18/2022 Please check out our recent work UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models. We open-sourced simple but SOTA/strong models for 21 tasks including text-to-SQL! Please check out our code in the UnifiedSKG repo!!
03/11/2021 Please check out a nice work from Google Research (including new Spider splits) for studying compositional generalization in semantic parsing!
11/15/2020 We will use Test Suite Accuracy as our official evaluation metric for Spider, SParC, and CoSQL. Please find the evaluation code from here. Also, Notice that Test results after May 02, 2020 are reported on the new release (collected some annotation errors).
08/03/2020 Corrected "column_name" and "column_name_original" mismatches in 2 dbs ("scholar" and "formula_1") in tables.json, and reparsed SQL queries (this only affects some models (e.g. RATSQL) which use our parsed SQL as the SQL input). Please download the Spider dataset from this page again.
06/07/2020 We corrected some annotation errors and label mismatches (not errors) in Spider dev and test sets (~4% of dev examples updated, click here for more details). Please download the Spider dataset from this page again.
01/16/2020 For value prediction (in order to compute the execution accuracy), your model should be able to 1) copy from the question inputs, 2) retrieve from the database content (database content is available), or 3) generate numbers (e.g. 3 in "LIMIT 3").
9/24/2019 (Min et al., EMNLP 2019) translated Spider to Chinese! Check out the Chinese challenge page.
5/17/2019 Our paper SParC: Cross-Domain Semantic Parsing in Context with Salesforce Research was accepted to ACL 2019! It introduces the context-dependent version of the Spider challenge: SParC!
5/17/2019 Please report any annotation errors here, we really appreciate your help and will update the data release in this summer!
1/14/2019 The submission tutorial is out!.
12/17/2018 We updated 7 sqlite database files (issue 14). Please download the Spider dataset from this page again.
10/25/2018 The evaluation script and results were updated (issue 5). Please download the latest versions of the script and papers. Also, please follow instructions in issue 3 to generate the latest SQL parsing results (fixed a bug).
Why Spider?
As the above spider chart shows, Spider 1.0 is distinct from most of the previous semantic parsing tasks because:
ATIS, Geo, Academic: Each of them contains only a single database with a limited number of SQL queries, and has exact same SQL queries in train and test splits.
WikiSQL: The numbers of SQL queries and tables are significantly large. But all SQL queries are simple, and each database is only a simple table without any foreign key.
Spider 1.0 spans the largest area in the chart, making it the first complex and cross-domain semantic parsing and text-to-SQL dataset! Read more on the blog post.
Getting Started
The data is split into training, development, and test sets. Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):
Details of baseline models and evaluation script can be found on the following GitHub site:
Data Examples
Some examples look like the following:
Have Questions or Want to Contribute?
Ask us questions at our Github issues page or contact Tao Yu, Rui Zhang, or Michihiro Yasunaga.
We expect the dataset to evolve. We would greatly appreciate it if you could donate us your non-private databases or SQL queries for the project.
Acknowledgement
We thank Graham Neubig, Tianze Shi, Catherine Finegan-Dollak, and the anonymous reviewers for their precious comments on this project. Also, we thank Pranav Rajpurkar for giving us the permission to build this website based on SQuAD.
Our team at the summit of the East Rock park in New Haven (The pose is "NLseq2SQL"):
Leaderboard - Execution with Values
Our current models do not predict any value in SQL conditions so that we do not provide execution accuracies. However, we encourage you to provide it in the future submissions. For value prediction, your model should be able to 1) copy from the question inputs, 2) retrieve from the database content (database content is available), or 3) generate numbers (e.g. 3 in "LIMIT 3"). Notice: Test results after May 02, 2020 are reported on the new release (collected some annotation errors).
URL: Spider Website
Suggested labels