salesforce / TabularSemanticParsing

Translating natural language questions to a structured query language
https://arxiv.org/abs/2012.12627
BSD 3-Clause "New" or "Revised" License
223 stars 51 forks source link

Failed to load other language data. #25

Open leon2milan opened 3 years ago

leon2milan commented 3 years ago

This is Chinese NL2SQL dataset. It has same format with wikisql. https://github.com/ZhuiyiTechnology/TableQA Except a little of difference.
"sql": [2]. It uses list to wrap the value. When I load this dataset, I get error. image I change the tokenizer to bert_base_chinese. still no working. So, what can I do to finetuen your model in Chinese NL2SQL dataset? Thank You very much!!!

todpole3 commented 3 years ago

Try replacing the BERT model we used with a multilingual LM such as mBERT or XLM-R. They can be accessed the same way via Hugging Face transformers library.

leon2milan commented 3 years ago

@todpole3 THX. In my dataset, some data's headers have duplicated name. I already fix this. And, there are two place difference. First, seq and agg items are list.

{
     "table_id": "a1b2c3d4", # related table id
     "question": "世茂茂悦府的套均面积是多少?", # QUESTION
     "sql":{ # SQL
        "sel": [7, 8], # SQL selected columns
        "agg": [0, 1], #  aggregate function
        "cond_conn_op": 0, # the relation of condition
        "conds": [
            [1,2,"世茂茂悦府"] # conditional columns, conditional type, conditional values,col_1 == "世茂茂悦府"
        ]
    }
}

Second, the representation of agg and op are different.

op_sql_dict = {0:">", 1:"<", 2:"==", 3:"!="}
agg_sql_dict = {0:"", 1:"AVG", 2:"MAX", 3:"MIN", 4:"COUNT", 5:"SUM"}
conn_sql_dict = {0:"", 1:"and", 2:"or"}

After I run the code, example.matched_values got OrderedDict(). How to deal with this?