osuossu8 commented 4 years ago

論文リンク

https://arxiv.org/abs/2001.04451

公開日（yyyy/mm/dd）

2020/01/13

概要

・計算コストの高い Transformer の計算効率効率のための手法を 2つ提案　　・内積注意 → 局所性鋭敏型ハッシュを使用して計算量オーダーを O(L^2) から O(L log L ) に削減 (L はシーケンス長) 　　・標準残差 → 可逆残差を使用することでN(層の数)回学習していたところを1回の学習で済むような強力な活性化効果を得られる・Reformer という　　・通常の Transoformer と同等の性能　　・メモリ効率が良い　　・長いシーケンスでは非常に効率が良い

公式実装

・https://github.com/google/trax/tree/master/trax/models/reformer

( cf : https://tksmml.hatenablog.com/entry/2020/01/26/181500 )

3rd party PyTorch 実装

・https://github.com/lucidrains/reformer-pytorch/tree/master/reformer_pytorch

・https://github.com/zbloss/reformer_lm/tree/master/reformer_lm

osuossu8 commented 4 years ago

計算量オーダー : https://qiita.com/drken/items/872ebc3a2b5caaa4a0d0

局所性鋭敏型ハッシュ (s locality-sensitive hashing) : https://ja.wikipedia.org/wiki/%E5%B1%80%E6%89%80%E6%80%A7%E9%8B%AD%E6%95%8F%E5%9E%8B%E3%83%8F%E3%83%83%E3%82%B7%E3%83%A5

内積注意 (dot-product Attention) : http://deeplearning.hatenablog.com/entry/transformer

osuossu8 commented 4 years ago

先行研究と比べてどこがすごい？

・従来の Transfomer はめっちゃメモリ消費量激しい　　・N層のモデルは 1層のモデルより誤差逆伝播の保存に N倍のメモリ消費量　　・Since the depth df f of intermediate feed-forward layers is often much larger than the depth dmodel of attention activations, it accounts for a large fraction of memory use. 　　・長さ L のシーケンスは計算量とメモリ消費量の両方で O(L^2) の計算量オーダーなので、64K token の1 シーケンスでもメモリ使い切る

・Reformer ここがすごい　　・可逆残差（http://papers.nips.cc/paper/6816-the-reversible-residual-network-backpropagation-without-storing-activations）で N層減らせる　　・活性化関数を ff層で分割、chunk処理して ff層のメモリを節約できる　　・attention layer において局所性鋭敏型ハッシュを使うと O(L^2) の計算量が O(L log L )に減って、長いシーケンスも処理できるようになる

(p1)

osuossu8 commented 4 years ago

LOCALITY-SENSITIVE HASHING ATTENTION

・Dot-product attention 　　・Transformer で使われる普通の attention

・Multi-head attention 　　・途中で h個に分岐して並列処理、最後に結合

・Memory-efficient attention 　　・各クエリ qi に対して個別に attention を計算できる　　・効率は悪いが長さに比例しかメモリ使用しない

・Locality sensitive hashing (https://arxiv.org/abs/1509.02897) 　　・高次元空間で高速に近傍点を見つける問題は LSH で解ける　　・各 vector x を h(x) に割り当てる　　・近傍 vector が高い確率で同じハッシュになるようにする　　・近傍 vector は高い確率でハッシュバケットのサイズが同じになるようにする　　・define h(x) = arg max([xR; −xR]) where [u; v] denotes the concatenation of two vectors

・LSH attention

osuossu8 commented 4 years ago

どうやって有効だと検証した？ @ LOCALITY-SENSITIVE HASHING ATTENTION

・Task : duplicate a sequence of symbols ・同じ loss 時点の acc を比較・eval の 2 or 1 hash の時に down, 8 hash の時は 100%

(p5)

osuossu8 commented 4 years ago

REVERSIBLE TRANSFORMER

・RevNets (Gomez et al. (2017)) 　　・あるレイヤーのパラメーターは前のレイヤーのパラメーターで表せる　　・y1 = x1 + F(x2), y2 = x2 + G(y1)

・Reversible Transformer 　　・Y1 = X1 + Attention(X2), Y2 = X2 + FeedForward(Y1) 　　・次のレイヤは前のレイヤを使って表せるので各レイヤで保存しなくて良い・Chunking

・Chunking, large batches and parameter reuse

(p6)

osuossu8 commented 4 years ago

osuossu8 commented 4 years ago

osuossu8 / paper-reading

[2020] Reformer: The Efficient Transformer #5

論文リンク

公開日（yyyy/mm/dd）

概要

公式実装

3rd party PyTorch 実装

先行研究と比べてどこがすごい？

LOCALITY-SENSITIVE HASHING ATTENTION

どうやって有効だと検証した？ @ LOCALITY-SENSITIVE HASHING ATTENTION

REVERSIBLE TRANSFORMER