[2019] TensorFlow 2.0 Question Answering

osuossu8 / kaggle-solution

8 stars 0 forks source link

[2019] TensorFlow 2.0 Question Answering #1

Open osuossu8 opened 4 years ago

osuossu8 commented 4 years ago

Competition link

https://www.kaggle.com/c/tensorflow2-question-answering

Description

In this competition, your goal is to predict short and long answer responses to real questions about Wikipedia articles. The dataset is provided by Google's Natural Questions, but contains its own unique private test set. A visualization of examples shows long and—where available—short answers.

Evaluation

micro F1 score

Gold and Silver Medal Solutions

1st 2nd 3rd 4th 6th 7th 8th 9th 17th 21st 23rd 27th 30th 31st 45th 47th

Other (if any)

submission format

-7853356005143141653_long,6:18
-7853356005143141653_short,YES
-545833482873225036_long,105:200
-545833482873225036_short,
-6998273848279890840_long,
-6998273848279890840_short,NO

osuossu8 commented 4 years ago

1st place solution

Link

https://www.kaggle.com/c/tensorflow2-question-answering/discussion/127551

[Flamework] PyTorch refered to public kernel

[Overview] ・baseline 論文で紹介されていたサンプリング法と異なった生成法で candidates を生成し、学習データとした。・負例は 40 million candidates もの教師データがあったので、each document で１つだけをサンプリングした。・効率性向上のため、一様分布に則ったサンプリングではなく、 hard negative sampling を採用・最終サブは 5 model のアンサンブル

[Sampling Strategy] ・最初は一様分布で負例サンプルしたが、結果は奮わなかった。・理由としてほとんどの負例候補は model にとって簡単すぎた。・model により難しい問題を解かせるため、hard negative sampling を採用。

[Hard negative sampling]

I firstly trained a model with uniform sampling, and predicted on the whole training data.
Stored the answer probability for each negative candidate.
The last step was to normalize the probabilities of negative candidates within documents to form a distribution
For the following model training the negative candidates could be sampled from the probability distribution.

・一様分布負例サンプル版 model で train data を pred, 負例候補とその確率を保存.
・document 毎に負例候補の確率を分布を作るように正規化
・後続の model ではこの確率分布に従って、負例をサンプリングする

osuossu8 commented 4 years ago

つまり

"long_answer_candidates": [
  { "start_token": 5, "end_token": 22, "top_level": true },          --> 正解
  { "start_token": 13, "end_token": 21, "top_level": false },        --> 負例, 簡単すぎる
  { "start_token": hoge0, "end_token": fuga0, "top_level": false },  --> 負例, 簡単すぎる
  { "start_token": hoge1, "end_token": fuga1, "top_level": false },  --> 負例, ちょっとマシ 
  { "start_token": hoge2, "end_token": fuga2, "top_level": false },  --> 負例, 簡単すぎる
]

一旦上でモデル作成、同じ candidate を予測して

"long_answer_candidates": [
  { "start_token": 5, "end_token": 22, "top_level": true },           --> 正解 prob 0.7X
  { "start_token": 13, "end_token": 21, "top_level": false },         --> 負例, 簡単すぎる prob 0.03X
  { "start_token": hoge0, "end_token": fuga0, "top_level": false },   --> 負例, 簡単すぎる prob 0.0X
  { "start_token": hoge1, "end_token": fuga1, "top_level": false },   --> 負例, ちょっとマシ prob 0.18X
  { "start_token": hoge2, "end_token": fuga2, "top_level": false },   --> 負例, 簡単すぎる prob 0.09X
]

[0.03X, 0.0X, 0.18X, 0.09X] --> 正規化する

上の確率分布で使って sampling & training

osuossu8 commented 4 years ago

[New Tokens]

・頻出の 9 html tags (<P>, <Table>, <Tr>, <Ul>, <Ol>, <Dl>, <Li>, <Dd>, <Dt>) --> added new token

・other html tags ( e.g. <strong>, <code>) --> I replaced them with a unique token in the tokenization dictionary or simply addedanother new token to represent them.

[Model Architecture, Training and Evaluation]

・ baseline paper と同じ (the model architecture was the same as the baseline paper (a 5 class classification branch + 2 span classification branch)) ・ 5class ("noanswer", "longansweronly", "shortanswer", "yes", "no") ・2span (there was no span prediction for answers without a short answer span because I directly used candidates)

・The loss update of the span prediction branch was simply ignored if no short answer span exist during training.

・In testing stage, for each document, I used 1.0-prob(noanswer) as the long answer score (confidence) for each candidate, and the candidate with the highest confidence was chosen to represent the document.

・Short answer spans were forced to be within the highest score long answer candidate (not sure if this is necessary).

・I used prob(shortanswer)+prob(yes)+prob(no) as the short answer score.

・The exact class of the short answer was determined by the maximum of the three prob values.

・For span prediction, the output token-level probabilities were mapped to the word-level (white space tokenized) probabilities for easier ensembling of models with different tokenizers.

osuossu8 commented 4 years ago

[Models and Results] ・5 models ensemble, one Bert-base, two Bert-large (WWM), and two Albert-xxl (v2) models, all uncased ・Bert-large と Albert は SQUAD data で事前学習・コンペの評価指標ではなく squad の eval data で CV 計算・Cased model を使わなかったのは、特段強い理由がなければ Case model は使わない方が良いというアドバイスを論文か何かで見たから・アンサンブルは全部 softmax 後、加重平均 (Albert は少し重く)したが、重み自体は重要ではない・Roberta と XLNet も採用検討した Roberta --> 実装ミスか何かでスコア悪化 XLNet --> I couldn't even run the provided huggingface squad tuning code.

[Finetuning Process] ・ tuning for 3-4 epochs ・ Early stopping was based on validation performance. ・The validation set (the dev) is the standard dataset provided in the NQ dataset along with the code for F1 score calculation.

[時短テク] ・5つのモデルのうち、Bert-base モデルで予測候補を proposal (pre-selection) し、残りのモデルでのみ predict を行なった。(Bert-base to predict "has long answer or not")

The proposal (pre-selection) works on candidates, not examples.
Assume an example has 100 candidates, the bert-base model will predict on all the 100 candidates, rank them with the long answer probabilities, then, choose the most probable 5 candidates for the bert-large to predict on. This way the big models will reduce the workload by 95%.

osuossu8 commented 4 years ago

2nd place solution

Link

https://www.kaggle.com/c/tensorflow2-question-answering/discussion/127333

・repo https://github.com/see--/natural-question-answering

・public notebook https://www.kaggle.com/seesee/submit-full

・2nd place weights https://www.kaggle.com/seesee/nq-bert-uncased-68

[計算環境] ・TPU ・ラストモデルの training に僅か 2h 少ししか掛からなかった (local の 2 * 1080ti だと数日かかるらしい)

[Flamework and Model] ・Single TF2.0 model with BERT-large backborn + custom heads

class TFBertForNaturalQuestionAnswering(TFBertPreTrainedModel):
    def __init__(self, config, *inputs, **kwargs):
        super().__init__(config, *inputs, **kwargs)
        self.num_labels = config.num_labels

        self.bert = TFBertMainLayer(config, name='bert')
        self.initializer = get_initializer(config.initializer_range)
        self.qa_outputs = L.Dense(config.num_labels,
            kernel_initializer=self.initializer, name='qa_outputs')
        self.long_outputs = L.Dense(1, kernel_initializer=self.initializer,
            name='long_outputs')

    def call(self, inputs, **kwargs):
        outputs = self.bert(inputs, **kwargs)
        sequence_output = outputs[0]
        logits = self.qa_outputs(sequence_output)
        start_logits, end_logits = tf.split(logits, 2, axis=-1)
        start_logits = tf.squeeze(start_logits, -1)
        end_logits = tf.squeeze(end_logits, -1)
        long_logits = tf.squeeze(self.long_outputs(sequence_output), -1)
        return start_logits, end_logits, long_logits

↑ long_logits が originality

[Overview]

・sampling を工夫した・具体的には empty answer ratio を full dataset と似た感じになるように調整（ざっと long answer と同じくらい）・baseline 論文と同じ低い empty answer ratio から始めるもスコア奮わず・HTML tag 用の New token は少し改善に寄与・start weights を色々試した. 結果は以下

bert-large-uncased (~0.70 LB) < bert-large-uncased-whole-word-masking (~0.72 LB) < bert-large-uncased-whole-word-masking-finetuned-squad (~0.73 LB).

・loss を工夫 (long と short の重みを等しく使用)

# fixed loss
loss = ((tf.reduce_mean(start_loss) + tf.reduce_mean(end_loss)) / 2.0 +
                tf.reduce_mean(long_loss)) / 2.0

・epoch は [1, 2, 3, 4] を試して local LB (CV か？) が最良の 2 を使用・epoch よりも sample ratio などのハイパラの方が重要

・long_logits の予測は binary clasification ではなく、softmax clasification ・その token が opening HTML tag かどうかと、 long answer でない場合は [CLS] token を予測した

・Roberta も実験したが bert-large-uncased に及ばず

・No. If we ignore YES/NO answers there are 4 (short, long) possibilities: (empty, empty), (empty, span), (span, empty), (span, span). They can all be predicted by the model.

osuossu8 commented 4 years ago

[New Tokens]

def get_add_tokens(do_enumerate):
    tags = ['Dd', 'Dl', 'Dt', 'H1', 'H2', 'H3', 'Li', 'Ol', 'P', 'Table', 'Td', 'Th', 'Tr', 'Ul']
    opening_tags = [f'<{tag}>' for tag in tags]
    closing_tags = [f'</{tag}>' for tag in tags]
    added_tags = opening_tags + closing_tags
    # See `nq_to_sqaud.py` for special-tokens
    special_tokens = ['<P>', '<Table>']
    if do_enumerate:
        for special_token in special_tokens:
            for j in range(11):
              added_tags.append(f'<{special_token[1: -1]}{j}>')

    add_tokens = ['Td_colspan', 'Th_colspan', '``', '\'\'', '--']
    add_tokens = add_tokens + added_tags
    return add_tokens

config = config_class.from_json_file(args.model_config)
tokenizer = tokenizer_class(args.vocab_txt, do_lower_case=do_lower_case)
tags = get_add_tokens(do_enumerate=args.do_enumerate)
num_added = tokenizer.add_tokens(tags, offset=1)

osuossu8 commented 4 years ago

3rd place solution

Link

https://www.kaggle.com/c/tensorflow2-question-answering/discussion/127339

[Software] ・PyTorch ・PyTorchLightening

[Hardware] ・local (3 GTX1080Ti and 2 GTX 1080Ti) ・Roberta1 training took 35h ・Finetuning roberta-large on SQuAD2.0 took 30h ・finetuning the resulting model to the data of this competition took about 24h when using a stride of 192

[Overview] ・3 roberta-large models which were ensembled by voting ・input : very close to bertjoint baseline ・LR : 1e-5 ・batch_size : 16 ・optim : simple Adam ・scheduler : nothing ・epoch : all models 1 epoch ・threshold : optimize for each models ・predicted test set with a stride of 224 to fit inference of 3 models into the kernel ・reused a lot of preprocessing scripts from bertjoint baseline shared by organisers

Roberta 1:

initialized with roberta-large weights
stride 128
prediction of span & 5 answer types (unknown, yes, no, short , long)
---
Roberta 2:

initialized with roberta-large weights, then pretrained on Squad2.0
stride 192
prediction of span & 2 answer types (short , long)
---
Roberta 3:

initialized with roberta-large weights, then pretrained on Squad2.0
additional linear layer (768→768 + relu) before predicting start, respectively end token
stride 192
prediction of span & 2 answer types (short , long)

[Validation Scheme] ・original の NQ dataset の dev set を val set として使用、LB と高い相関

[Architectures and pretrained models] ・bertjoint baseline に勝つの無理では??? ・baseline と別な前処理を試みるも残念・how to aggregate the resulting predictions --> short span length should be less than 30 tokens のような制約の下、 So what we did is to map the start and end token predictions of each window back to the original answer and create a answer length x answer length heatmap. --> The argmax of this matrix then gives start and end token (here 957:973).

[Thresholding] ・f1 で最適化するのは challenging なので 4-way thresholding worked best. ・ We build thresholds for long and short answer type as well as logits of start + end tokens ・4d grid search で閾値決定 (using scipy.optimize.minimize) ・we also experimented with using the corresponding quantiles

[Ensembling]

majority vote between the results
blend model predictions and apply thresholding
threshold 前の logits の blend もした

の両者を提出

[高速化] ・use stride of 224 for test data ・convert model to fp16 for predictions

model = TFQARoberta()
model.half().cuda()

no need for apex

・use multiprocessing for preprocessing and postprocessing

[Other] ・over all window での予測値を集約する際、baseline paper にあったような最大値を取ってくるのではなく start と end の logit の平均を利用した・To use the first short answer span instead of the convex hull of all short answer spans ・ We tried to use all short answer spans for training instead of the first one, either by creating one window for each span, or by using BCE on start and end logits to accomodate the presence of several 1s. None of these improve, to the contrary ↑ここは改良の余地あり

osuossu8 commented 4 years ago

4th place solution

Link

https://www.kaggle.com/c/tensorflow2-question-answering/discussion/127371

[Preprocessing] ・No preprocessing for Text. ・Different Negative sampling rate. Tried 0.02, 0.04 and 0.06.

[Data Aug] ・TTA. not work ・Change the answer by replacing it with similar questions' answer. not work ・Transform from other question answer datasets like squad and hotpotQA. not work

[Models] ・Tried XLNet, Bert Large Uncased/Cased, SpanBert Cased, Bert Large WWM. ・Same loss function and prediction as Bert-joint script. ・Cased << Uncased ・WWM BERT Large Uncased performed best

[Knowledge Distillation] ・knowledge distillation is the key part in 4th solution

・Trained a combined Bert-large model, by adding bert-large weight and wwm-bert-large weight like 0.8 * wwm-bert-large + 0.2 * bert-large. 1 step, 3e-5 lr.
    --> I tried different approaches like averaged bert output, concat, multiply.
    --> 0.8 / 0.2 の比率は val score が最大になるものを採用

・Freeze bert layer only finetune classifier weights. 2 step, 1e-5 lr.
・Treat the first model as a teacher model and do knowledge distillation to get a student model.
・Finetune student model (bert large model) with only classifer weights. 3 step 1e-5 lr.

・For knowledge distillation, teacher's score is lower than student model.

osuossu8 commented 4 years ago

6th place solution

Link

https://www.kaggle.com/c/tensorflow2-question-answering/discussion/127521

・public notebook https://www.kaggle.com/prokaj/fork-of-baseline-html-tokens-v5

・pre and post processing code https://www.kaggle.com/prokaj/bert-baseline-pre-and-post-process

・final model in saved model format https://www.kaggle.com/prokaj/tpu-2020-01-22

・model code (used on tpu) https://www.kaggle.com/prokaj/tpu-code

・BERT implementation from official tensorflow models (preinstalled on TPU) https://github.com/tensorflow/models/tree/master/official

[Overview] ・Single BERT based model

[Preprocessing] ・baseline script の特殊なトークン([ContextId=...][Paragraph=0]など)は無くした・単純化されたhtmlタグ（テーブルタグなどのcolspan情報を含んでいたものは削除しました）はそのまま使った・各セグメントの最初と最後に<*>, </*>を追加・I kept 4 % of the negative examples, and also kept the very long answers that were not contained within one segment. ・I also processed the entire document text （original script にあった max_contexts は無視した）

[Model output] ・classification head (return span_start_logits, span_end_logits) ・With masking this used to get both the long answer, short answer logits ・I also added ''cross'' head, which is a bilinear function of the pairs of the sequence output of the BERT model. Short span logits then obtained as the sum of the start and end logits and the corresponding output of the cross head. ・Impossible spans were masked out and softmax gave the span probabilities. For the long span cross entropy criterion was used both for start and end logits. For the short spans the error was negative log of the total probability of positive short spans. These error terms were computed only for examples having long, short answers. So the aim here is to learn the position given that there is an answer, the probability of having an answer came from the answer_type output.

[Postprocessing] ・For each segment the long and short spans with maximal probability was computed ・From the answer type head the probabilities of having a short or long answer in the segment were computed and these probabilities were assigned to the most likely spans within the segment ・These votes were maximized over all segments containing the given span. ・Then the spans with highest overall scores was considered for the answer ・Threshold は CV みて決めた

[Training] ・TPU ・2 epochs ・lr 2.5e-5 ・bs 64 ・NQ data で train する前に同じ設定、前処理した squad 2.0 data で finetune

osuossu8 commented 4 years ago

7th place solution

Link

https://www.kaggle.com/c/tensorflow2-question-answering/discussion/127259

[Framework and hardware] ・TF1.15 and Google Cloud TPUs

[Validation scheme and experiment setup] ・joint-bert 実装をベース、強い・bs = 64 ・lr=4e-5 ・1epoch ・save checkpoints every 500 or 1000 steps

[公式 joint-bert の改良] ・参考 IBM paper

1) pre-trained weights

・official joint-bert は "BERT-Large, Uncased" のところ、"BERT-Large, Uncased (Whole Word Masking)" に変更したら big boost

2) negative sampling

・NQ の問題で finetune する前に、Squad 2.0 で finetune ・Official joint-bert samples 一律 2% negative examples in both answerable questions, answerable は 1%, unanswerable は 4% に変更 --> 1% f1 が良くなった.

3) max_seq_length, doc_stride

・Default are 512 and 128 respectively. --> seems like an overkill ・changed doc_stride to 256 during inference --> スコアあまり変わらずで時間半分に・For training, we used default 128 as well 192 and 256. IBM paper claims 192 gives best results, but we didn't see much difference.

4) max_contexts

・Default is 48. ・We tried different values like 100 and 200. Using a bigger value is tradeoff between more answer coverage v.s. more "empty" windows.

5) sentence order shuffling

・IBM paper で提案されてた・augmentation method ・shuffling all the sentences in the paragraph containing short answers

6) cased

・do_lower_case=False ・ we generated a new vocab file by adding all the NQ special tokens into the cased BERT vocab.

7) Attention-over-attention

・IBM paper で最も効いたとされている・効かなかった

[Ensemble] Our 2 submissions consist of the following 5 single models: a. wwm, stride=256, dev 62.4 b. wwm, neg sampling, pre-tuned on squad, dev 64.7 (long 69.5, short 57.8) - best single model c. wwm, neg sampling, max_contexts=200, dev 64.5 d. wwm, neg sampling, stride=192, dev 63.8 e. wwm, neg sampling, cased, dev 63.3

sub1: ensemble of a,b,c, dev 66.8 (long 71.6, short 59.8), private LB 0.69 sub2: ensemble of c,d,e, dev 67.0 (long 71.6, short 59.9), private LB 0.69 (note: these scores are after post-processing)

[Post-process]

yes/no thresholds

・dev set でチューニング・ If the yes/no logits in the answer_type_logit are over the thresholds, predict "YES"/"NO" regardless of the short span predictions. ・f1 が 0.5 boost した

max_contexts

Increase max_contexts from the default 48 to 100 or 80 can squeeze out another 0.3 F1 points, taking advantage of the leftover inference time within the 3 hour limit. For sub1 we did 100; for sub2 we only did 80 because generating features for the cased model e took a little more time.

[Code] repo: https://github.com/boliu61/tf2qa inference notebook and model weights: https://www.kaggle.com/boliu0/7th-place-submission

osuossu8 commented 4 years ago

8th place solution

Link

https://www.kaggle.com/c/tensorflow2-question-answering/discussion/127545

Preprocessing

・wiki article の window をずらす代わりに、top level long answer 候補を別々に前処理した・長さが許す場合は 1個の train example に、長すぎる場合は複数個の train example に短い場合は周囲の候補を追加・152k の正例、1200万over の負例になった。負例は 160k まで減らした・負例の sampling は Q と候補の tf-idf 類似度が近いものを採用・training の epoch ごとに重複のない異なる負例 subset を採用

・For span predictor

Model

・RoBERTa-large のみ・with a new output layer on top of it ・a token-level span predictor for short answers, I use a binary classifier to determine whether a candidate is a long answer or not.

・span 予測には XLNet の trick を使用します。開始 token と終了 token を別々に予測するのではなく、最初に開始 token を予測し、その表現を最終 encoder 層からすべての token の表現に連結し、連結された表現を終了 token 予測器への入力として渡します。つまり、end token の予測は start token に条件付けされているので、span 予測の品質が大幅に向上します。

For span predictor, I use a trick from XLNet: 
instead of predicting start and end tokens independently,
 I first predict the start token, 
then concatenate its representation from the final encoder layer to representations of all the tokens 
and pass these concatenated representations as input
 to the end token predictor. 
This means that the prediction of the end token is conditioned on the start token, 
which significantly improves the quality of span prediction.

・Yes/No answer はスコアが下がるので入れないことにした・予測時は、高スコアの long answer candidate を見つけて、スコアが閾値以上ならこの short span も予測する。同じ閾値を使用。・閾値決めには公式の NQ dev set 使用。

Training hyperparameters

・AdamW optimizer with weight decay of 0.01 and a linearly decaying learning rate with warmup for all experiments ・5 epochs, bs 48, max_lr 2e-5 ・3 epochs, bs 24, max_lr 3e-5 ・2 epochs, bs 15, max_lr 3e-5 ・Training RoBERTa-large for 1 epoch (312k training examples) takes approximately 4 hours on a single V100 GPU using mixed precision.

Ensembling

・3 models summing output layer logits ・時間が足りないので最初の N 候補のみ long answer candidate を予測することに。

Some ideas that did not quite work

・SQuAD2.0 pretraining は最後は使わなかった・output layer architecture を変えて事前学習する時に自分のコードがバグっていたかも・whether a candidate contains a short answer or not を判別する二値分類器を追加したが CV は向上しなかったので使わなかった。結局 private test では微向上だった。

osuossu8 commented 4 years ago

9th place solution

Link

https://www.kaggle.com/c/tensorflow2-question-answering/discussion/128278

Using TPU
train with bs = 128
single model
- Bert Joint with some tweaks and postprocessing
- Pretrained model: Whole-Word-Masking Bert Large
- Tfrecords generated with include_unknowns=0.2 (10 time more examples without answer than in original paper)
- Trained 1 epoch with batch size 128, lr=5e-5 (4-5 hours on TPU)
- Use answer type logits:
  - If answertype=1 => yesno_answer=’NO’
  - If answertype=2 => yesno_answer=’YES’
  - If answer_type=4 => no short answers
- Get some answers with top_level=False (postprocessing)
  - if 2 long answer candidates contain short answer and one candidate is toplevel and another candidate is not toplevel and it starts with "Li" HTML token => about 70% chance that correct candidate is non top_level one
- Linear regression over 9 logits as answer verifier
  - 9 logits included 5 answer type logits, clsstartlogit, clsendlogit, startspanlogit, endspanlogit.

Codes

Inference kernel: https://www.kaggle.com/user189546/tfqa-bert-train-tf2
Model weights: https://www.kaggle.com/user189546/unk0201128w
Train code: https://www.kaggle.com/user189546/tfqa-train-code
Tfrecords: https://www.kaggle.com/user189546/train-tfrecords

osuossu8 / kaggle-solution