microsoft / IRNet

An algorithm for cross-domain NL2SQL
MIT License
264 stars 81 forks source link

Need help to preprocess wikisql dataset to SemQL format? #36

Closed anshudaur closed 4 years ago

anshudaur commented 4 years ago

HI , I am trying to convert wikisql dataset to SemQL format to test IRNET on other data sets. And test other models on the SEMQL representation of nl and sql query to justify if intermediate format can actually help.

Has anyone tried converting other datasets to SEMQL format to test IRNet on other datasets? If not, can the authors please suggest how can i do so apart from manually mapping the json differences and try to convert to semql ?

Any help is appreciated.

Thanks and Best Regards Anshu

jaydeepb-inexture commented 4 years ago

Please refer this https://github.com/taoyds/spider/blob/master/preprocess/parse_sql_one.py . Follow the instruction mentioned their.You get idea how to do this.

anshudaur commented 4 years ago

HI @jaydeepb-ddit , Thanks. But my question is more related to semantics part(semQL) that gets generated after preprocessing the queries. For example, in spider after preprocessing to semql, we get a kew value pair of type :

"names": ["*", "department id", "name", "creation", "ranking", "budget in billions", "num employees", "head id", "name", "born state", "age", "department id", "head id", "temporary acting"]

since in wikisql, the column name is not meaningful, what do i identify as the names field for a corresponding wikisql query? Some fields are meaningful but others are not so much like names : image

anshudaur commented 4 years ago

HI @jaydeepb-ddit , Thanks. But my question is more related to semantics part(semQL) that gets generated after preprocessing the queries. For example, in spider after preprocessing to semql, we get a kew value pair of type :

"names": ["*", "department id", "name", "creation", "ranking", "budget in billions", "num employees", "head id", "name", "born state", "age", "department id", "head id", "temporary acting"]

since in wikisql, the column name is not meaningful, what do i identify as the names field for a corresponding wikisql query? Some fields are meaningful but others are not so much like names : image

The empty fields are the onse which are present and common for both wikisql and spider

anshudaur commented 4 years ago

Query for wiki sql and spider -

wikisql -tokenized_train.jsonl

json----------- {"phase":1,"question":"Tell me what the notes are for South Australia ","sql":{"sel":5,"conds":[[3,0,"SOUTH AUSTRALIA"]],"agg":0},"table_id":"1-1000181-1","tokenizedquery":["SELECT","notes","FROM","table","WHERE","current","slogan","EQL","south","australia"],"query":"SELECT notes FROM table_ WHERE current slogan EQL south australia","tokenized_question":["tell","me","what","the","notes","are","for","south","australia"]}

tables.jsonl------------ {"caption":null,"header":["State\/territory","Text\/background colour","Format","Current slogan","Current series","Notes"],"id":"1-1000181-1","name":"table_1000181_1","page_id":null,"page_title":null,"rows":[["Australian Capital Territory","blue\/white","Yaa\u00b7nna","ACT \u00b7 CELEBRATION OF A CENTURY 2013","YIL\u00b700A","Slogan screenprinted on plate"],["New South Wales","black\/yellow","aa\u00b7nn\u00b7aa","NEW SOUTH WALES","BX\u00b799\u00b7HI","No slogan on current series"],["New South Wales","black\/white","aaa\u00b7nna","NSW","CPX\u00b712A","Optional white slimline series"],["Northern Territory","ochre\/white","Ca\u00b7nn\u00b7aa","NT \u00b7 OUTBACK AUSTRALIA","CB\u00b706\u00b7ZZ","New series began in June 2011"],["Queensland","maroon\/white","nnn\u00b7aaa","QUEENSLAND \u00b7 SUNSHINE STATE","999\u00b7TLG","Slogan embossed on plate"],["South Australia","black\/white","Snnn\u00b7aaa","SOUTH AUSTRALIA","S000\u00b7AZD","No slogan on current series"],["Victoria","blue\/white","aaa\u00b7nnn","VICTORIA - THE PLACE TO BE","ZZZ\u00b7562","Current series will be exhausted this year"]],"section_title":null,"types":["text","text","text","text","text","text"],"tokenized_header":[["state\/territory"],["text\/background","colour"],["format"],["current","slogan"],["current","series"],["notes"]]}

Missing fields in wikisql : select,col_set, db_id , names, keys, table_names : single table id, sql (fields are different)

spider initial train_spider_processed ------------- {"db_id": "department_management", "query": "SELECT count() FROM head WHERE age > 56", "query_toks": ["SELECT", "count", "(", "", ")", "FROM", "head", "WHERE", "age", ">", "56"], "query_toks_no_value": ["select", "count", "(", "", ")", "from", "head", "where", "age", ">", "value"], "question": "How many heads of the departments are older than 56 ?", "question_toks": ["how", "many", "head", "of", "department", "are", "older", "than", "56", "?"], "sql": {"except": null, "from": {"conds": [], "table_units": [["table_unit", 1]]}, "groupBy": [], "having": [], "intersect": null, "limit": null, "orderBy": [], "select": [false, [[3, [0, [0, 0, false], null]]]], "union": null, "where": [[false, 3, [0, [0, 10, false], null], 56.0, null]]}, "names": ["", "department id", "name", "creation", "ranking", "budget in billions", "num employees", "head id", "name", "born state", "age", "department id", "head id", "temporary acting"], "table_names": ["department", "head", "management"], "col_set": ["*", "department id", "name", "creation", "ranking", "budget in billions", "num employees", "head id", "born state", "age", "temporary acting"], "col_table": [-1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2], "keys": {"12": 7, "7": 7, "11": 11, "1": 1}, "origin_question_toks": ["How", "many", "heads", "of", "the", "departments", "are", "older", "than", "56", "?"], "question_arg": [["how"], ["many"], ["head"], ["of"], ["department"], ["are"], ["older"], ["than"], ["56"], ["?"]], "question_arg_type": [["NONE"], ["NONE"], ["table"], ["NONE"], ["table"], ["NONE"], ["MORE"], ["NONE"], ["value"], ["NONE"]], "nltk_pos": [["how", "WRB"], ["many", "JJ"], ["head", "NN"], ["of", "IN"], ["department", "NN"], ["are", "VBP"], ["older", "JJR"], ["than", "IN"], ["56", "CD"], ["?", "."]]}