microsoft / IRNet

An algorithm for cross-domain NL2SQL
MIT License
264 stars 81 forks source link

Preprocessing question: keys "col_set" and "names" in spider dataset #38

Closed anshudaur closed 4 years ago

anshudaur commented 4 years ago

HI All,

I am trying to preprocess wikisql dataset to semql format. But i am not able to understand what does "col_set" and "names" correspond to? How keys "col_set" and "names" get values from spider dataset? This will help me in getting values for wikisql dataset.

Thanks Anshu

jaydeepb-inexture commented 4 years ago

col_set and names both represents the column name for a particular database which is mentioned in "db_id". There is slight difference between them,col_set is representation of all the column_names while col_set representating only unique column names.

for example: "col_set": ["*", "stadium id", "location", "name", "capacity", "highest", "lowest", "average", "singer id", "country", "song name", "song release year", "age", "is male", "concert id", "concert name", "theme", "year"],

"names": ["*", "stadium id", "location", "name", "capacity", "highest", "lowest", "average", "singer id", "name", "country", "song name", "song release year", "age", "is male", "concert id", "concert name", "theme", "stadium id", "year", "concert id", "singer id"],

Above col_set and names are for "db_id": "concert_singer", in names :-["stadium id", "year", "concert id", "singer id","name"] are repeating . but in col_set each and every name are unique.

col_set takes the column name values only for spider spider. you can set column names for col_set for wikisql database. @anshudaur maybe this helps you.

anshudaur commented 4 years ago

@jaydeepb-ddit : thanks for your reply. I am actually stuck because the column names for tables in wikisql are names as col1, col2 etc for all tables and with only 2 types of datatypes(real, text)

i think i will need actual column names, but wikisql provides the results with generic names(like col1, col2 ...). My question now is, how should i proceed if the actual names for column itself in the wikisql dataset is not provided?

Thanks Anshu

jaydeepb-inexture commented 4 years ago

train.tables.json file is present in wikisql dataset which gives list of column names ,in the header field of json.So ,you can take column names from there.

https://github.com/salesforce/WikiSQL under this table section. ""header: a list of column names in the table.""

anshudaur commented 4 years ago

@jaydeepb-ddit : Thanks :)