Paper link pdf
Nitish Shirish Keskar∗, Bryan McCann∗, Lav R. Varshney, Caiming Xiong, Richard Socher
Introduction
We release CTRL, a 1.6 billion-parameter conditional transformer language model, trainedto condition on control codes that govern style, content, and task-specific behavior.
For example, large resources like Wikipedia, Project Gutenberg, and Amazon Reviews can each be assigned a domain-related control code. Smaller resources, like the content extracted from individual subreddits, often occur with both a broader domain name, reddit, as well as subdomain information, r/subdomain. In the vast majority of cases, text collected for training is associated with a URL, which often contains information pertinent to the text it represents.
Language Modeling with CTRL
Basically a transformer model (huge transformer model) with multi head attention.
3.2 Experimental Settings
We learn BPE (Sennrich et al., 2015) codes and tokenize the data using fastBPE, but we use a large vocabulary of roughly 250K tokens. This includes the sub-word tokens necessary to mitigate problems with rare words, but it also reduces the average number of tokens required to generate long text by including most common words.
BPE: Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations).
Each sequence originated from a domain, and it has the corresponding domain control code prepended as the first token in the sequence. In this way, domain control codes receive special treatment. They are propagated to all text in the domain as the first token. This is similar to how codes and natural language sequences have been used in multi-task settings (Wu et al., 2016; Johnson et al., 2017; McCann et al., 2018) to control conditional language models. All other control codes are injected into the data without such special treatment
Sampling
(Gumble Softmax)
The next token is then chosen by sampling through a multinomial distribution with probabilities pi clipped at the top-k tokens. In the equation above, T → 0 approximates a greedy distribution which magnifies the peaks in the probability distribution while T → ∞ flattens the distribution to make it more uniform. Rather than choosing a fixed value of k, as is common practice, Holtzman et al. (2019) suggested adapting k heuristically. The nucleus sampling approach chooses a probability threshold p_t and sets k to be the lowest value such that ∑_i sort(p_i) > p_t. If the model is confident in its next-word prediction, then k will be lower and vice versa. Despite the improved generative capabilities of models with such heuristics, there still exists a trade-off between these parameters depending on the generation intended.
Given a prompt: Q: What is the capital of Australia?, a well-trained model as- signs higher probability mass to the correct answer, Canberra, but a non-zero probability mass to other cities such as Melbourne, Sydney, Brisbane, Darwin, and Perth, see Figure 1. By choosing to sample, we mistrust the model, despite it being correct. A natural solution to this is to choose the next token greedily. However, this is known to create repetitions of phrases or sentences even for large well-trained models
This penalized sampling works by discounting the scores of previously generated tokens. The motivation is similar to coverage mechanisms (See et al., 2017) and other losses designed to discourage repetition (Welleck et al., 2019), but penalized sampling is not used during training. Given a list of generated tokens g, using the notation from equation 1, the probability distribution pi for the next token is defined as:
We find that using a greedy sampling and θ ≈ 1.2 yields a good balance between truthful generation and lack of repetition. Note that θ = 1 is equivalent to equation 1. We note in passing that this approach succeeds only if the model has learned a sufficiently reliable distribution.
Paper link pdf Nitish Shirish Keskar∗, Bryan McCann∗, Lav R. Varshney, Caiming Xiong, Richard Socher
Introduction
Language Modeling with CTRL
Basically a transformer model (huge transformer model) with multi head attention.
3.2 Experimental Settings
Sampling
(Gumble Softmax)
The next token is then chosen by sampling through a multinomial distribution with probabilities pi clipped at the top-k tokens. In the equation above,
T → 0
approximates a greedy distribution which magnifies the peaks in the probability distribution whileT → ∞
flattens the distribution to make it more uniform. Rather than choosing a fixed value ofk
, as is common practice, Holtzman et al. (2019) suggested adaptingk
heuristically. The nucleus sampling approach chooses a probability thresholdp_t
and setsk
to be the lowest value such that∑_i sort(p_i) > p_t
. If the model is confident in its next-word prediction, thenk
will be lower and vice versa. Despite the improved generative capabilities of models with such heuristics, there still exists a trade-off between these parameters depending on the generation intended.Given a prompt: Q: What is the capital of Australia?, a well-trained model as- signs higher probability mass to the correct answer, Canberra, but a non-zero probability mass to other cities such as Melbourne, Sydney, Brisbane, Darwin, and Perth, see Figure 1. By choosing to sample, we mistrust the model, despite it being correct. A natural solution to this is to choose the next token greedily. However, this is known to create repetitions of phrases or sentences even for large well-trained models
This penalized sampling works by discounting the scores of previously generated tokens. The motivation is similar to coverage mechanisms (See et al., 2017) and other losses designed to discourage repetition (Welleck et al., 2019), but penalized sampling is not used during training. Given a list of generated tokens g, using the notation from equation 1, the probability distribution pi for the next token is defined as:
We find that using a greedy sampling and
θ ≈ 1.2
yields a good balance between truthful generation and lack of repetition. Note thatθ = 1
is equivalent to equation 1. We note in passing that this approach succeeds only if the model has learned a sufficiently reliable distribution.