[48] SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

TL;DR

task : efficient Transformer -> Machine Translation, Language Modeling, Representation leaning in Graph, Image Classification
problem : self-attention 연산의 $O(n^2)$이 비효율적이다
idea : 인풋 시퀀스를 그래프로 보고 attention 연산을 연결된 node에 대해서만 하자
architecture : LSTM을 통해 source node가 주어졌을 때 target edge predicting, 이후 연결된 edge들에 대해서만 self-attention 수행
objective : ground truth edge를 알 수 없기 때문에 edge training을 할 때에는 self-attention 까지 한 후의 성능을 reward로 주는 policy gradient 를 적용. self-attention의 경우 각 task에 맞는 loss.
baseline : Transformer, Sparse Graph Attention Networks, Reformer
data : newstest2013(WMT), Enwiki8/Text8(LM), CIFAR100/ImageNet(Image Classification)
result : SOTA와 견줘볼만한 성능. memory cost는 매우 줄임.
contribution : 트랜스포머의 quadratic을 graph로 바꾼 점
limitation or 이해 안되는 부분 : 학습이 엄청 까다로울 것 같다. LSTM에서 edge prediction 할 때 latency가 엄청 생기지 않을까?