howardyclo / papernotes

My personal notes and surveys on DL, CV and NLP papers.
128 stars 6 forks source link

Zero-shot Sequence Labeling: Transferring Knowledge from Sentences to Tokens #17

Open howardyclo opened 6 years ago

howardyclo commented 6 years ago

Metadata

Authors: Marek Rei and Anders Søgaard Organization: University of Cambridge & University of Copenhagen Conference: NAACL 2018 Paper: https://arxiv.org/pdf/1805.02214.pdf Code: https://github.com/marekrei/mltagger

howardyclo commented 6 years ago

Summary

This paper shows that it is possible to infer token-level label based on attention mechanism, despite that the model is only trained on sentence-level classification (binary classification). Evaluation on text uncertainty detection, grammatical error detection and sentiment classification along with several alternative methods shows that their attention-based method achieves best performance and are competitive to fully-supervised method. Also, the paper gives interesting attention visualization that interprets model's predictions.

Motivation

Network architecture

  1. Word + char-level embeddings for word representation + bi-LSTM + tanh activation for mapping forward and backward hidden states into the same space, and then concatenate them into a single hidden state h_{i} for each word w_{i}.
  2. Compute attention score e_{i} by adding a feedforward layer on h_{i}, and normalize e_{i} over all words to probability distribution. Note that they did not use a softmax function to normalize the probabilities since exponential function in softmax would encourage the attention to prioritize only one label. The paper did not explain this well, I assume softmax encourage to learn a sharp probability distribution (?). Instead, they use logistic(sigmoid) activation to compute each attention score a_{i}^{hat}, and then normalize them a_{i}.
  3. The sentence representation c_{i} is computed as weighted sum of a_{i} and h_{i} over all words.
  4. The pre-normalization attention score a_{i}^{hat} is used as a score for sequence labeling, with a decision boundary 0.5.
  5. Finally, c_{i} is fed into a feedforward layer and predict a binary label for the overall sentence.

Constraints to learn high quality attention

  1. Only some, but not all, tokens in the sentence can have a positive label.
  2. There are positive tokens in a sentence only if the overall sentence is positive.

Objective function

L = L1 + r(L2 + L3), where L1 is L2-loss from sentence classification, L2 and L3 are regularization losses designed for the above two constrains respectively (see paper for details).

Comparison with alternative methods

Instead of using attentions, three alternative methods can also infer token-level labels: