Closed goncalomcorreia closed 2 years ago
Even though our method is sparse, in this work we did not focus on the potential benefits of such sparsity on the computational complexity of Transformers. Treviso et al. (2021) focuses on taking our adaptively sparse transformers and preventing the quadratic complexity with respect to the input sequence. In order to do this, they propose Sparsefinder, a method to predict in advance the sparsity of $\alpha$-entmax. They avoid the full computation of the score matrix by learning a projection into a lower-dimensional space of the queries and keys, making our proposed entmax attention an efficient alternative to the vanilla transformer architecture.
They focus on inducing sparsity of the attention heads via a pruning and quantization technique. Their method focuses on inference time, to avoid having to train new models from scratch; this way one can get a checkpoint from a trained dense transformer and make it sparse only at inference time. Their sparsity at the application time turns out to not decrease the accuracy of the trained model significantly. Furthermore, when analysing the attention distribution, they find that their sparsity correlates to our analysis in Figure X, showing that our analysis also applies to the pre-trained language models they used: BERT and RoBERTa.
In this work, the authors push sparsity to the limit by proposing a method that uses hard retrieval in the attention mechanism that is able to attend to only a single token in each attention head. They apply this method to the decoder attention layers and empirically show that their method is 1.43 times faster at decoding while maintaining good performance on translation tasks.
In this work, the authors unify several works that sparsify Transformers in a single framework, including our approach. This unified framework is constructed through conditions based on the sparsity pattern and the probability map. Once such conditions are satisfied, the authors prove that some sparse transformers (of which our approach is included) are universal approximators of any continuous sequence-to-sequence function. Furthermore, they show that when these sparse transformers have only O(n) connections (which is in contrast to the constant O(n^2) connections of dense transformers), they still hold the same universal approximation properties.
"They detect uninformative source encodings and drop them to shorten the encoding sequence before generation". Their sparsity is achieved by pruning the encoder outputs using their proposed method $\mathcal{L}_{0}DROP$. This leads to a significant speed-up during decoding. It drops whole sections from the source encodings to speed up decoding, and the authors find that the tokens that end up being dropped are often uninformative tokens, while the retained ones are relatively rare.
In this work, instead of replacing the softmax with entmax, the authors replace it with the ReLU activation. When compared to our approach, ReLA (Rectified Linear Attention) obtains faster training and decoding time, while achieving close performance in translation tasks. In their analysis, they followed our methodology and found that their method had higher JS Divergence than our own, suggesting that attention heads using ReLA tend to be more diverse in where they attend at each layer. Furthermore, ReLA is also able to assign null attention for some queries, that is, to effectively deactivate an attention head for a specific query, as it is able to assign zero values to all tokens. Their analysis shows that the rate at which null attention happens increases in deeper layers, which correlates with our own analysis that deeper layers contain less information.
Inspired by our analysis along with other works (cite here) that encoder self-attention heads learn fixed patterns (ref section), this work fixes the attention pattern of all but one head in each layer, letting only a single head in each layer learn its attention. Those fixed patterns include the positional heads that focus on the current, previous, and next token and the BPE-merging head that we found. The parameter footprint of the model drastically decreases thanks to these simplifications, and the authors show empirically that translation quality does not drop significantly and even that, in low-resource scenarios, this approach improves translation performance.
📝 Description
A section of this chapter will be dedicated to research papers that cited the current chapter's paper. We'll need to go over all the papers that cite this chapter's paper, decide which ones to mention, and write a section about them where we highlight the connections.
Inspiration: Schlichtkrull thesis.
☑️ Tasks