Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast—Choose Three

soyamash / read_paper

memo for NLP paper

0 stars 0 forks source link

Ensemble Distillation for Structured Prediction: Calibrated, Accurate, Fast—Choose Three #6

Open soyamash opened 3 years ago

soyamash commented 3 years ago

ニューラルモデルのstructured predictionにおけるCalibrationにおいては通常の isotonic regressionやtemperature scalingがそのままでは適用できない。この論文で提案するensemble distillationは予測精度を落とさずにCalibrationを実現する。 https://www.aclweb.org/anthology/2020.emnlp-main.450/

soyamash commented 3 years ago

教師モデルを複数用意し、そのアンサンブル結果を用いて生徒モデルに蒸留することで、精度とともにcalibrationスコアも上がる（また、post-hocな手法と比較して、calibration用の評価データを必要としない） 2020-12-26_132527

soyamash commented 3 years ago

NERの結果。IID、CRFは教師モデルの話で、生徒モデルはどちらもIID形式で蒸留している

soyamash commented 3 years ago

機械翻訳の結果。subwordを32k使用するが、これに対する確率分布を単純に蒸留に使用するのは計算効率が悪いので、時刻tに対してtop-kの確率分布のみ保存し、これを真似するように生徒モデルを訓練する。ラベルスムージングを使用した場合と、普通のCross Entropyを使用した場合を比較する。（LSのEnsemble Distillationは著しくCalibration能力が下がったため掲載しない） ECE-1、ECE-5は次トークンのtop-kのECEスコア

LSは以前の研究で示されているように、精度とCalibration能力をわずかに上げるが、Ensemble時の結果を下げる。

soyamash commented 3 years ago

翻訳時の蒸留に使用するtoken数を変えた際のスコア変化

soyamash commented 3 years ago

アンサンブルに際して、ひとつのモデルの学習途中のスナップショットを使用する場合との比較。ちゃんと複数モデル訓練するほうが良さそう。

soyamash commented 3 years ago

temperature scalingはEnsemble Distillationと組合せた際良さそう