data in spider format - Githubissues

mousaazari commented 1 year ago

Hi, Can you please share any tips to construct the imdb, scholar, and yelp datasets in spider format (including tables.json train.json dev.json)? Thank you

jkkummerfeld commented 9 months ago

@mousaazari Just saw you closed this. Were you able to do the conversion / construction? If so, I'd be glad to incorporate it into the repository.

Sorry for not responding sooner!

mousaazari commented 9 months ago

After analyzing the related files, I figured out that there are similar datasets in the spider. There is a slight difference in the total number of instances, which was negligible in my case. So we can extract the required information from the JSON files. I used the following script to extract the IMDB subset from the spider dataset:

import pickle, json

dataset_name = 'imdb'
dataset = json.load(open('spider.json', 'r'))
new_dataset = []
for entry in dataset:
    if(entry['db_id'] in (dataset_name)):
        new_dataset.append(entry)

with open(dataset_name + '.json', 'w') as outfile:
        json.dump(new_dataset, outfile, indent=4)

Thank you for all the helpful information in this repo.

jkkummerfeld / text2sql-data

data in spider format #54