1) Train general-domain LM (in the paper on wikitext 200MB)
3-layered LSTM
training 2 models - forward & backward - the classifier prediction is the average of both
2) Target task LM fine-tuning (on the task dataset)
discriminative fine-tuning: use different learning rates on different layers, smaller LR on lower layers (closer to input), empirically detremined RL(l-1) = RL(l) / 2.6
slanted triangular learning rates: in the beginning linearlt increase RL, then decrease
3) Target task classifier fine-tuning
as input for classiffier take concatenation of [h(T) , max(h(1) ... h(T)), mean(h(1), ... h(T))] where h(i) is the hidden layer of from processing i-th word
classifier's architecture as a sequence of layers: linear, dropout, ReLU, linear, dropout, ReLU, softmax
grafual unfreezing: in first epochs only train last layer of LM, in next epooch gradually train lower
Experiments
all on 3 classification tasks on 6:
sentiment analysis - binary on IMDb, 5-class on Yelp
question annotation to 6 categories
topic classification - AG news, DBpedia
for all using "error rates", most probably (misclassified/all)
don't evaluate perplexity of vanilla LM (probably normal LSTM vs. AWD-LSTM), only error rate when vanilla is used instead of their model
Read the paper https://aclweb.org/anthology/P18-1031, write notes, present to collaborators.