Report that character-based convolutional NMT model is prone to noisy text inputs
Explore two approaches to remedy
structure-invariant word representation
robust training on noisy texts
Details
Introduction
humans have surprisingly robust language processing system that easily overcomes typos, mis-spellings and the complete omission of letters when reading. Below sentence has 0 correct words, but we can understand them
-“Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae.”
the mechanism behind our perception is still not fully known
In Google translate, above sentence gets decoded as below (retrieved on March 27, 2018)
"Cmabrigde Uinervtisy에서의 연구에 대한 연구에 따르면, 그 연구에서의 연구자는 존재하지 않았고, 그 연구자의 연구는 진부한 것이었고, 연구자는 연구실에 있었다."
while typos and noise are not new to NLP, NMT systems are rarely trained to explicitly address them
Quantitatively, performance substantially drops with small amounts of noise in NMT
Natural noise for parallel corpus is not generally available, so utilize available corpora of edits
French : WiCoPaCo, Wikipedia edit history corpus
German : RWSE Wiki Revision Dataset and MERLIN corpus of language learners
Czech : manually annotated essays written by non-native speakers
Natural noises (capitalization, incorrect replacement of voiced/voiceless consonants, missing palatalization, error in numbers, inflection, colloquial forms and others) are difficult to generate synthetically
insert these errors into source-side of the parallel data with uniform sampling
Synthetic noise can be categorized into four types
Swap : swapping two letters in the middle applied to word length >= 4 (noise -> nosie)
Middle Random : randomize the order of all letters in the middle (noise -> nisoe)
Fully Random : completely random word (noise -> iones)
Keyboard Typo : replace one letter with and keyboard-adjacent key (noise -> noide)
noise is put into every word of the source corpora (maybe too much?)
performance drops significantly with both synthetic and natural noise
Google Translate after spell checking
Test texts with and without natural errors corrected by Google spell-checkers
In Fr and De, single predicted corrections do recover natural errors, but not fully
In Cz, various correction candidates exist due to rich morphology and grammatical structure, which leads to even less improvement when using spell-checker
Dealing with Noise
Structure Invariant Representation
meanChar
average character embeddings and use word-level encoder similar to charCNN model
insensitive to most noise except Key and Nat, expected to work on scrambling noise only
result of meanChar trained and tested on different noise condiditons
vanilla performance drops by 5~7 BLEU
models fail to learn on some combinations of noise
Cz fails to learn on meanChar due to complex morphology : e.g, eat, tea, ate, tae will be represented as same word, where each may have distinct meaning
Black-Box Adversarial Training
replace the original training set with a noisy training set
Results on charCNN trained and tested on different noise conditions
performance on original test set is decent
model trained on specific kind of noise perform well on the same kind of noise at test time
model trained on Rand performs well on Swap and Mid during test time, but not vice versa
synthetic noise does not improve test data with natural noise and vice versa
models trained on mixed noise performs worse on vanilla test set, but performs better on average over all noise kinds
Analysis - how the noise is learnt in charCNN
model trained on Rand noise performed better with adversarial training method than meanChar. we hypothesize that different conv filters learnt to be robust to different kinds of noise, especially learning mean or sum operation for Rand noise, employing equal or close to equal weights
Visualize variance of conv filter weights
as expected, model trained on Rand noise has lowest variance, meaning its weights are uniformly distributed in order to capture the structure-invariant representation.
Analysis - Richness of Natural Noise
Qualitative analysis on 40 samples show that major noise arise from
phonetic or phonological phenomena in the language
character omission
rest are incorrect morphological conjugation of verbs, key swaps, character insertion, orthographic variants etc
In short, natural noise are not directly captured by synthetic noise generation in this paper
Personal Thoughts
surprised to see this paper accepted in ICLR
very well experimented and well written paper
wonder why they have not used ConvS2S or Transformer as baseline model, not even a single mention
they name it adversarial training, but it's simply a data augmentation
Abstract
Details
Introduction
“Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae.”
"Cmabrigde Uinervtisy에서의 연구에 대한 연구에 따르면, 그 연구에서의 연구자는 존재하지 않았고, 그 연구자의 연구는 진부한 것이었고, 연구자는 연구실에 있었다."
Random
: random permutation of wordsSwap
: swapping a pair of adjacent wordsNatural
: natural human errorsMT Systems
char2char
Nematus
charCNN
Data
Noise : Natural and Artificial
Swap
: swapping two letters in the middle applied to word length >= 4 (noise
->nosie
)Middle Random
: randomize the order of all letters in the middle (noise
->nisoe
)Fully Random
: completely random word (noise
->iones
)Keyboard Typo
: replace one letter with and keyboard-adjacent key (noise
->noide
)Effect of Noise in Test Set
Google Translate after spell checking
Dealing with Noise
meanChar
charCNN
modelKey
andNat
, expected to work on scrambling noise onlymeanChar
trained and tested on different noise condiditonsmeanChar
due to complex morphology : e.g,eat
,tea
,ate
,tae
will be represented as same word, where each may have distinct meaningcharCNN
trained and tested on different noise conditionsRand
performs well onSwap
andMid
during test time, but not vice versaAnalysis - how the noise is learnt in
charCNN
Rand
noise performed better with adversarial training method thanmeanChar
. we hypothesize that different conv filters learnt to be robust to different kinds of noise, especially learning mean or sum operation forRand
noise, employing equal or close to equal weightsRand
noise has lowest variance, meaning its weights are uniformly distributed in order to capture the structure-invariant representation.Analysis - Richness of Natural Noise
Personal Thoughts
Link : http://lanl.arxiv.org/pdf/1711.02173.pdf Authors : Belinkov et al. 2018