jkkummerfeld / text2sql-data

A collection of datasets that pair questions with SQL queries.
http://jkk.name/text2sql-data/
Other
534 stars 105 forks source link

data in spider format #54

Closed mousaazari closed 9 months ago

mousaazari commented 1 year ago

Hi, Can you please share any tips to construct the imdb, scholar, and yelp datasets in spider format (including tables.json train.json dev.json)? Thank you

jkkummerfeld commented 9 months ago

@mousaazari Just saw you closed this. Were you able to do the conversion / construction? If so, I'd be glad to incorporate it into the repository.

Sorry for not responding sooner!

mousaazari commented 9 months ago

After analyzing the related files, I figured out that there are similar datasets in the spider. There is a slight difference in the total number of instances, which was negligible in my case. So we can extract the required information from the JSON files. I used the following script to extract the IMDB subset from the spider dataset:

import pickle, json

dataset_name = 'imdb'
dataset = json.load(open('spider.json', 'r'))
new_dataset = []
for entry in dataset:
    if(entry['db_id'] in (dataset_name)):
        new_dataset.append(entry)

with open(dataset_name + '.json', 'w') as outfile:
        json.dump(new_dataset, outfile, indent=4)

Thank you for all the helpful information in this repo.