sail-sg / symbolic-instruction-tuning

The official repository for the paper "From Zero to Hero: Examining the Power of Symbolic Tasks in Instruction Tuning".
MIT License
61 stars 3 forks source link

Dataset generation script #3

Open imoneoi opened 10 months ago

imoneoi commented 10 months ago

Can you share your dataset generation script for symbolic SQL data? I found some invalid SQL and wanted to improve it.

There are spaces in table column names, which is invalid, as shown in the example below.

[ header: no. | country | 2009 winter universiade | 2007 wjcc | 2007 wwcc | 2008 wjcc | 2008 wwcc | points row 1 : 1 | canada | 24 | 12 | 9 | 12 | 10 | 67
row 2 : 2 | china | 28 | None | 14 | 4 | 6 | 52
row 3 : 3 | sweden | 10 | 5 | 12 | 14 | 9 | 50
row 4 : 4 | great britain | 16 | 14 | 5 | 1 | 12 | 48
row 5 : 5 | russia | 20 | 8 | 6 | 6 | 5 | 45
row 6 : 6 | united states | 4 | 6 | 4 | 10 | 8 | 32
row 7 : 7 | switzerland | None | 10 | 8 | 8 | 3 | 29
row 8 : 8 | germany | None | None | 7 | 2 | 14 | 23
row 9 : 9 | denmark | None | 3 | 10 | None | 7 | 20
row 10 : 10 | czech republic | 12 | 4 | None | 3 | None | 19
row 11 : 11 | south korea | 8 | None | 3 | None | None | 11
row 12 : 12 | japan | 6 | 1 | None | None | 2 | 9
row 13 : 13 | france | None | 2 | None | 5 | None | 7
row 14 : 14 | norway | None | None | 2 | None | 4 | 6
row 15 : 15 | poland | 2 | None | None | None | None | 2
row 16 : 16 | italy | None | None | 1 | None | None | 1
row 17 : 17 | latvia | None | None | None | None | 1 | 1
row 18 : None | turkey (host) | None | None | None | None | None | 0 ] Execute this SQL based on the above table: select country where 2007 wwcc = ( select min ( 2007 wwcc ) )
SivilTaram commented 10 months ago

Hi @imoneoi , thanks for your interest on our work! Sure I'd like to share the dataset generation script. I use the script at https://github.com/microsoft/Table-Pretraining/tree/main/data_generator to synthesize the dataset. I'm still trying to build one clean repo to synthesize SQL queries from any table in the csv format - but it may still require some time 😂