Zero-shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens

Summary

This paper shows that it is possible to infer token-level label based on attention mechanism, despite that the model is only trained on sentence-level classification (binary classification). Evaluation on text uncertainty detection, grammatical error detection and sentiment classification along with several alternative methods shows that their attention-based method achieves best performance and are competitive to fully-supervised method. Also, the paper gives interesting attention visualization that interprets model's predictions.

Motivation

While this approach is not expected to outperform a fully supervised sequence labeling method, it opens possibilities for making use of text classification datasets where collecting token-level annotation is not possible or cost-effective.
By formulating the task as a zero-shot labeling problem, it can provide quantitative evaluations of what the model is learning and where it is focusing.

Network architecture

Word + char-level embeddings for word representation + bi-LSTM + tanh activation for mapping forward and backward hidden states into the same space, and then concatenate them into a single hidden state h_{i} for each word w_{i}.
Compute attention score e_{i} by adding a feedforward layer on h_{i}, and normalize e_{i} over all words to probability distribution. Note that they did not use a softmax function to normalize the probabilities since exponential function in softmax would encourage the attention to prioritize only one label. The paper did not explain this well, I assume softmax encourage to learn a sharp probability distribution (?). Instead, they use logistic(sigmoid) activation to compute each attention score a_{i}^{hat}, and then normalize them a_{i}.
The sentence representation c_{i} is computed as weighted sum of a_{i} and h_{i} over all words.
The pre-normalization attention score a_{i}^{hat} is used as a score for sequence labeling, with a decision boundary 0.5.
Finally, c_{i} is fed into a feedforward layer and predict a binary label for the overall sentence.

Constraints to learn high quality attention

Only some, but not all, tokens in the sentence can have a positive label.
There are positive tokens in a sentence only if the overall sentence is positive.

Objective function

L = L1 + r(L2 + L3), where L1 is L2-loss from sentence classification, L2 and L3 are regularization losses designed for the above two constrains respectively (see paper for details).

Comparison with alternative methods

Instead of using attentions, three alternative methods can also infer token-level labels:

Gradient magnitude from backpropagation. (Can be done without attention mechanism)
Relative Frequency Baseline
Supervised Sequence Labeling

howardyclo / papernotes