GRAPPA: Grammar-augmented pre-training for table semantic parsing

提出了一种用于text2sql任务的预训练方法。

使用预训练方法训练后的模型，可以像BERT一样，将输入的自然语言转换为向量表示，作为text2sql模型的输入。

相较于其他的预训练方法，GRAPPA只使用现有的相关的表格数据集，所以预训练的速度会比较快。

复制到markdown编辑器以获取更好的阅读体验

信息

主要作者：Tao Yu
单位：Salesforce Research & Yale University & University of Edinburgh
论文链接

1 text2sql任务的数据集

数据集一般包括两个部分，一个部分包含所有的数据库，每个数据库中都有数个表，每个表里有数列，一个数据库里，包含表的主键、外键等关系；另外一个部分是question-SQL pair，每一对都对应于一个数据库，目的是将自然语言文本的question正确转化为可以在对应数据库中执行并返回结果的SQL语句。

2 构建预训练模型

预训练时，使用了data augmentation的方法，生成了人工数据，并通过收集，收集了一些和表格任务相关的数据，然后通过预训练对这些预料进行训练。

data augmentation

通过分析spider数据集中的SQL-question pair，通过前面的语法，将question中与SQL对应的部分用语法的相对应成分进行替换，整理出现的语法规则，然后找到出现次数最多的几十条语法规则，根据这些规则以及替换的部件，利用WIKISQL和Spider的训练集中的表来生成新的question和SQL pair

这样可以使模型熟悉自然语言中的表达和SQL中的部件之间的联系

例子：

question: Show the locations that have at least two performances.

SQL： SELECT location FROM performance GROUP BY location HAVING COUNT(*) >=2

语法规则：select column1 from table group by column1 having count(*) >= number

SQL与question相对应的、可替换的部分：

（at least）对应（>=）、（two）对应（2）、location可以替换成其他列，performance可以替换成其他表

预训练的模型

从$RoBERTa_{LARGE}$开始初始化，然后利用前述的data augmentation生成的人工data以及一些相关数据集中找的tables and contexts，将原始的/人工生成的自然语言问题以及对应数据库中的行和列名使用特殊符号\<\s>拼接起来，作为预训练的输入

通过下述的两个目标进行训练

预训练的目标

MLM objective：只用其他数据集中收集来的表格和问题进行训练，将表格的列名和自然语言问题的15%部分mask起来，然后预测mask的部分，通过交叉熵进行训练

SSP objective：只用生成的人工数据进行训练，用用自然语言描述的问题来预测数据库中的列是否有被提及，且该问题对应的SQL中有哪些与该列相关的操作（像MAX GROUPBY等）。同上述模型输入一样，列之间有使用\</s>隔开，训练时是利用\</s>部分的向量表示通过两层的线性层+GELU+layer normalization+交叉熵损失

3 使用预训练的结果

效果不错，实验中WIKISQL数据集相比于$RoBERTa{LARGE}$可以得到3%，Spider相比于$BERT{large}$可以得到4%的提升

而且因为像BERT一样，所以通用性很强，可以适应不同的encoder，也可以和一些进行decoder预训练的，如GP: Context-free Grammar Pre-training for Text-to-SQL Parsers，一起使用。

4 好的句子

We present GRAPPA, an effective pre-training approach for table semantic parsing that learns a compositional inductive bias in the joint representations of textual and tabular data. Recent pre-training language models (LMs) such as BERT and RoBERTa achieve tremendous success in a spectrum of natural language processing tasks, including semantic parsing. These advances have shifted the focus from building domain-specific semantic parsers to cross-domain semantic parsing. Recently the field has witnessed a surge of interest in joint textual-tabular data understanding problems, such as table semantic parsing, question answering, retrieval, fact-checking, and summarization. Unlike all the previous work where augmented data is used in the end task training, we apply the framework to language model pre-training. Training semantic parsers is usually slow, and augmenting a large amount of syntactic pairs directly to the end task training data can be prohibitively slow or expensive.

izhx / paper-reading