What they do when in doubt: a study of inductive biases in seq2seq learners

在训练的样本较少时，有时候模型并没有足够的信息来预测出我们想要的结果，而模型就需要“脑补”出一部分的信息来做出选择，这可以被称为模型的inductive bias。这篇论文设计了四个实验，并提出了一种衡量inductive bias的方法：description length，用来观察各类基本的seq2seq模型在信息不足的情况下，会倾向于计算出怎样的结果。

例：2 ？ 2 → 4，+与*同时都可以使这个式子成立，如果只有这一笔训练数据，那么模型会更偏向于+还是*呢？

复制到markdown编辑器以获取更好的阅读体验

信息

主要作者：Eugene Kharitonov & Rahma Chaabouni
单位：Facebook AI
论文链接

1 衡量inductive bias的方法：descriptive length

证明部分没咋看懂，举例说一下大概的想法就是，当我们已有部分信息时，传递特定剩余信息所需的代价与模型本身的偏好有关，如果模型通过已有信息训练后更偏向于产生某类的训练数据，那么这一类的训练数据的传递的代价就小，代价可以用交叉熵来计算，举例来说：

现在有2 ？ 2 → 4这一条训练数据，我们有两个保留的数据作为测试集，分别对应于+和*的结果，分别是3+3→6，3*3→9，对各种模型，先利用2 ？ 2 → 4进行训练，来推测3 ？3 → 6还是9, 通过下面公式分别计算两个保留的数据的交叉熵损失就可以得出模型倾向于选择+还是*，3+3的这条数据的熵比较小则说明模型倾向于选+，3*3同理。 WH%KYL41JUIZ10L5FSY`T@E

$$ L{M}(D) = -\sum{t=2}^{k}logpM{t-1}(y{t}|x_{t})+c $$ $c$是常数，$M$是模型

2 实验的设计

设计了四类实验：

Count-or-Memorization、Add-or-Multiply、Hierarchical-or-Linear、Composition-or-Memorization

举例： What they do when in doubt：a study of inductive biases in seq2seq learners_pic1

3 实验的结果

一共对四种模型进行了测试，分别是LSTM without attention、LSTM with attention、CNN以及transformer

CNN和transformer在有记忆这个选项时，都更倾向直接记忆（memorization），除了最后一个测试，CNN在训练数据越来越多的情况下，会倾向于组合（composition）
两种LSTM都随着训练数据更长而更倾向于计数（count）、生成倍数值（multiply）、重视数据的结构（hierarchical）
CNN在最后一个测试中和其他模型相反，更倾向于组合（composition）

4 其他

让我比较意外的是transformer，虽然在很多地方它的表现都优于LSTM，但是在这几个实验中都更倾向于记忆，似乎表现的没有LSTM好，但是可能也是因为transformer的参数比较多，它本身就需要更多的训练数据（即使从这些数据我们无法看出测试集最后的答案），我猜想如果有更多的数据，transformer可能也会倾向于更为“高级”的操作。

CNN虽然相较于其他模型来说，在自然语言处理上应用较少，更多低是在CV方面大放异彩，但是CNN在最后一个测试上，相较于其他模型，表现出了从“人”的角度看，更为高级的偏好，在自然语言处理上，CNN是否有它无可替代的地方呢？

5 好的句子

Yet, these models have been criticized for requiring a tremendous amount of data and being unable to generalize systematically

To illustrate the setup we work in, consider a quiz-like question

we take this principle to the extreme and study biases of seq2seq learners in the regime of very few training examples, often as little as one

W.l.o.g, we assume that there are two candidate “rules” that explain the training data, but do not coincide on the hold-out data

The solution proposed by Solomonoff (1964) is to select the continuation that admits “the simplest explanation” of the entire string, i.e. that is produced by programs of the shortest length (description length)

The problem of calculating LM(D), D = {xi, yi}k i=1 is considered as a problem of transferring outputs yi one-by-one, in a compressed form, between two parties, Alice (sender) and Bob (receiver). Alice has the entire dataset {xi, yi}, while Bob only has inputs {xi}. Before the transmission starts, both parties agreed on the initialization of the model M, order of the inputs {x}, random seeds, and the details of the learning procedure. 我比较喜欢这个句子的介绍思想的方法

In this task, we contrast learners’ preferences for counting vs. memorization.

izhx / paper-reading