osuossu8 commented 4 years ago

論文リンク

https://arxiv.org/abs/1901.08634

公開日（yyyy/mm/dd）

2019/01/24

概要

Natural Questions タスクにおける BERT base の新 baseline を提案.
モデル f1 score と人間の上限値との gap を長答課題と短答課題で、それぞれ30%と50% 相対的に改善.

著者実装

osuossu8 commented 4 years ago

The key insights in our approach

長答課題と短答課題を 2つのモデルで段階的に解くより、1つのモデルで一度に pred した方がよかった
- 原文 : to jointly predict short and long answers in a single model rather than using a pipeline approach,
それぞれの docs を token 重複を許してスライドして入手した複数の training instance に分割する (original BERT for SQuAD でもやってる)
- 原文 : to split each document into multiple training instances by using overlapping windows of tokens, like in the original BERT model for the SQuAD task,
答えのない instance は train 時には大胆に落とし、balanced training set を作る
- 原文 : to aggressively downsample null instances (i.e. instances without an answer) at training time to create a balanced training set,
ー
- 原文 : to use the “[CLS]” token at training time to predict null instances and rank spans at inference time by the difference between the span score and the “[CLS]” score.

osuossu8 commented 4 years ago

Data Prerocessing

30,522 wordpiece vocwblary で tokenize
1 example につき複数の training instance 生成
- [CLS] + the tokenized question + [SEP] + tokens from the content of the document + [SEP]
- 計 512 tokens
- 128 の倍数番目の token からスタートし、512 token 取って 1 instance とする試行を可能な限り繰り返し、instance を生成する
- 1 example につき平均して 30 instance できた
各 instance で target answer span を表す start と end の token を計算する
- If all annotated short spans are contained in the instance, we set the start and end target indices to point to the smallest span containing all the annotated short answer spans.
- If there are no annotated short spans but there is an annotated long answer span completely contained in the instance, we set the start and end target indices to point to the entire long answer span.
- If no short or long span can be found in the current instance, we set the target start and end indices to point to the “[CLS]” token. We dub the instances in the last category “null instances”.
データ巨大、アノテーション不十分につき、98% の generated instance は null になるので downsample
- 結果 512 token の 500,000 instance が train set として残る
special markup tokens の導入
- the form “[Paragraph=N]”, “[Table=N]”, and “[List=N]” at the beginning of the N-th paragraph
- the first few paragraphs and tables に annotated answer が含まれてる可能性が高いので model に教えるとプラスになる
- specail tokens are not split further by the wordpiece model.
each instance を 5値分類する
- short, yes, no, long, no-answer

osuossu8 commented 4 years ago

Model

training instance を 4 tuple にする
- (c, s, e, t)
- c : 512 wordpiece ids
- s, e ∈ {0, 1, ... , 511} : s2e の target answer span
- t : target type label ∈ { short, long, yes, no, no-answer}

loss 関数

L = − logp (s, e, t|c)
  = − logp start(s|c) − logp end(e|c) − logp type(t|c),

val set を使って閾値を決める
- 長答のみ ~ no answer
model の初期化は SQuAD 1. で finetune された BERT を使用
- Adam, bs 8, lr 3・10^-5
- Tesla P100 1枚で 5h on the NQ dev set

[EOF]

osuossu8 / paper-reading

[2019] A BERT Baseline for the Natural Questions #3

論文リンク

公開日（yyyy/mm/dd）

概要

The key insights in our approach

Data Prerocessing

Model