goncalomcorreia / phd-thesis-goncalomcorreia

The LaTeX of Gonçalo M. Correia's PhD thesis. The "Making Of" of this thesis can be checked in Issues/Projects.
Other
1 stars 0 forks source link

Chapter 4: Research and Write a draft about Subsequent Work #6

Closed goncalomcorreia closed 2 years ago

goncalomcorreia commented 2 years ago

📝 Description

A section of this chapter will be dedicated to research papers that cited the current chapter's paper. We'll need to go over all the papers that cite this chapter's paper, decide which ones to mention, and write a section about them where we highlight the connections.

Inspiration: Schlichtkrull thesis.

☑️ Tasks

goncalomcorreia commented 2 years ago

medical

Efficient

Long-range

Interpretability

particularly connected

goncalomcorreia commented 2 years ago

Predicting Attention Sparsity in Transformers, arxiv

Even though our method is sparse, in this work we did not focus on the potential benefits of such sparsity on the computational complexity of Transformers. Treviso et al. (2021) focuses on taking our adaptively sparse transformers and preventing the quadratic complexity with respect to the input sequence. In order to do this, they propose Sparsefinder, a method to predict in advance the sparsity of $\alpha$-entmax. They avoid the full computation of the score matrix by learning a projection into a lower-dimensional space of the queries and keys, making our proposed entmax attention an efficient alternative to the vanilla transformer architecture.

On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers, ACL-IJCNLP 2021

They focus on inducing sparsity of the attention heads via a pruning and quantization technique. Their method focuses on inference time, to avoid having to train new models from scratch; this way one can get a checkpoint from a trained dense transformer and make it sparse only at inference time. Their sparsity at the application time turns out to not decrease the accuracy of the trained model significantly. Furthermore, when analysing the attention distribution, they find that their sparsity correlates to our analysis in Figure X, showing that our analysis also applies to the pre-trained language models they used: BERT and RoBERTa.

Learning Hard Retrieval Decoder Attention for Transformers, EMNLP 2021

In this work, the authors push sparsity to the limit by proposing a method that uses hard retrieval in the attention mechanism that is able to attend to only a single token in each attention head. They apply this method to the decoder attention layers and empirically show that their method is 1.43 times faster at decoding while maintaining good performance on translation tasks.

O(n) Connections are Expressive Enough, NeurIPS 2020

In this work, the authors unify several works that sparsify Transformers in a single framework, including our approach. This unified framework is constructed through conditions based on the sparsity pattern and the probability map. Once such conditions are satisfied, the authors prove that some sparse transformers (of which our approach is included) are universal approximators of any continuous sequence-to-sequence function. Furthermore, they show that when these sparse transformers have only O(n) connections (which is in contrast to the constant O(n^2) connections of dense transformers), they still hold the same universal approximation properties.

On Sparsifying Encoder Outputs, ACL-IJCNLP 2021

"They detect uninformative source encodings and drop them to shorten the encoding sequence before generation". Their sparsity is achieved by pruning the encoder outputs using their proposed method $\mathcal{L}_{0}DROP$. This leads to a significant speed-up during decoding. It drops whole sections from the source encodings to speed up decoding, and the authors find that the tokens that end up being dropped are often uninformative tokens, while the retained ones are relatively rare.

Sparse Attention with Linear Units, EMNLP 2021

In this work, instead of replacing the softmax with entmax, the authors replace it with the ReLU activation. When compared to our approach, ReLA (Rectified Linear Attention) obtains faster training and decoding time, while achieving close performance in translation tasks. In their analysis, they followed our methodology and found that their method had higher JS Divergence than our own, suggesting that attention heads using ReLA tend to be more diverse in where they attend at each layer. Furthermore, ReLA is also able to assign null attention for some queries, that is, to effectively deactivate an attention head for a specific query, as it is able to assign zero values to all tokens. Their analysis shows that the rate at which null attention happens increases in deeper layers, which correlates with our own analysis that deeper layers contain less information.

Fixed Encoder Self-Attention Patterns, EMNLP 2020

Inspired by our analysis along with other works (cite here) that encoder self-attention heads learn fixed patterns (ref section), this work fixes the attention pattern of all but one head in each layer, letting only a single head in each layer learn its attention. Those fixed patterns include the positional heads that focus on the current, previous, and next token and the BPE-merging head that we found. The parameter footprint of the model drastically decreases thanks to these simplifications, and the authors show empirically that translation quality does not drop significantly and even that, in low-resource scenarios, this approach improves translation performance.