Synthetic and Natural Noise Both Break Neural Machine Translation

Abstract

Report that character-based convolutional NMT model is prone to noisy text inputs
Explore two approaches to remedy
- structure-invariant word representation
- robust training on noisy texts

Details

Introduction

humans have surprisingly robust language processing system that easily overcomes typos, mis-spellings and the complete omission of letters when reading. Below sentence has 0 correct words, but we can understand them -“Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae.”
- the mechanism behind our perception is still not fully known
In Google translate, above sentence gets decoded as below (retrieved on March 27, 2018)
- "Cmabrigde Uinervtisy에서의 연구에 대한 연구에 따르면, 그 연구에서의 연구자는 존재하지 않았고, 그 연구자의 연구는 진부한 것이었고, 연구자는 연구실에 있었다."
- while typos and noise are not new to NLP, NMT systems are rarely trained to explicitly address them
Quantitatively, performance substantially drops with small amounts of noise in NMT
- Random : random permutation of words
- Swap : swapping a pair of adjacent words
- Natural : natural human errors

MT Systems

Fr/De/Cz -> En task
char2char
- fully character-level model introduced in Lee et al. 2017
- Encoder with conv, highway, rnn layers and Decoder with standard rnn
Nematus
- popular NMT toolkit
- subword BPE (Byte Pair Encoding) based
charCNN
- LSTM based implementation by Kim et al. 2016
- character based

Data

IWSLT 2016
- TED talks parallel corpus

Noise : Natural and Artificial

Natural noise for parallel corpus is not generally available, so utilize available corpora of edits
- French : WiCoPaCo, Wikipedia edit history corpus
- German : RWSE Wiki Revision Dataset and MERLIN corpus of language learners
- Czech : manually annotated essays written by non-native speakers
- Natural noises (capitalization, incorrect replacement of voiced/voiceless consonants, missing palatalization, error in numbers, inflection, colloquial forms and others) are difficult to generate synthetically
- insert these errors into source-side of the parallel data with uniform sampling
Synthetic noise can be categorized into four types
- Swap : swapping two letters in the middle applied to word length >= 4 (noise -> nosie)
- Middle Random : randomize the order of all letters in the middle (noise -> nisoe)
- Fully Random : completely random word (noise -> iones)
- Keyboard Typo : replace one letter with and keyboard-adjacent key (noise -> noide)
noise is put into every word of the source corpora (maybe too much?)
noise generation code is openly available

Effect of Noise in Test Set

performance drops significantly with both synthetic and natural noise

Google Translate after spell checking

Test texts with and without natural errors corrected by Google spell-checkers
- In Fr and De, single predicted corrections do recover natural errors, but not fully
- In Cz, various correction candidates exist due to rich morphology and grammatical structure, which leads to even less improvement when using spell-checker

Dealing with Noise

Structure Invariant Representation
- meanChar
- average character embeddings and use word-level encoder similar to charCNN model
- insensitive to most noise except Key and Nat, expected to work on scrambling noise only
- result of meanChar trained and tested on different noise condiditons
- vanilla performance drops by 5~7 BLEU
- models fail to learn on some combinations of noise
- Cz fails to learn on meanChar due to complex morphology : e.g, eat, tea, ate, tae will be represented as same word, where each may have distinct meaning
Black-Box Adversarial Training
- replace the original training set with a noisy training set
- Results on charCNN trained and tested on different noise conditions
- performance on original test set is decent
- model trained on specific kind of noise perform well on the same kind of noise at test time
- model trained on Rand performs well on Swap and Mid during test time, but not vice versa
- synthetic noise does not improve test data with natural noise and vice versa
- models trained on mixed noise performs worse on vanilla test set, but performs better on average over all noise kinds

Analysis - how the noise is learnt in `charCNN`

model trained on Rand noise performed better with adversarial training method than meanChar. we hypothesize that different conv filters learnt to be robust to different kinds of noise, especially learning mean or sum operation for Rand noise, employing equal or close to equal weights
Visualize variance of conv filter weights
- as expected, model trained on Rand noise has lowest variance, meaning its weights are uniformly distributed in order to capture the structure-invariant representation.

Analysis - Richness of Natural Noise

Qualitative analysis on 40 samples show that major noise arise from
- phonetic or phonological phenomena in the language
- character omission
- rest are incorrect morphological conjugation of verbs, key swaps, character insertion, orthographic variants etc
- In short, natural noise are not directly captured by synthetic noise generation in this paper

Personal Thoughts

surprised to see this paper accepted in ICLR
very well experimented and well written paper
wonder why they have not used ConvS2S or Transformer as baseline model, not even a single mention
they name it adversarial training, but it's simply a data augmentation

Link : http://lanl.arxiv.org/pdf/1711.02173.pdf Authors : Belinkov et al. 2018

kweonwooj / papers

Synthetic and Natural Noise Both Break Neural Machine Translation #99

Abstract

Details

Introduction

MT Systems

Data

Noise : Natural and Artificial

Effect of Noise in Test Set

Google Translate after spell checking

Dealing with Noise

Analysis - how the noise is learnt in `charCNN`

Analysis - Richness of Natural Noise

Personal Thoughts

kweonwooj / papers

Synthetic and Natural Noise Both Break Neural Machine Translation #99

Abstract

Details

Introduction

MT Systems

Data

Noise : Natural and Artificial

Effect of Noise in Test Set

Google Translate after spell checking

Dealing with Noise

Analysis - how the noise is learnt in charCNN

Analysis - Richness of Natural Noise

Personal Thoughts

Analysis - how the noise is learnt in `charCNN`