Word Translation without Parallel Data

Abstract

propose a training framework that builds a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way
outperforms existing supervised methods on cross-lingual tasks for some language pairs

Details

Introduction

Word Embedding
- word2vec was proposed by Mikoliv et al 2013a for learning distributed representation of words in unsupervised manner
- Levy & Goldberg et al 2014 showed that the skip-gram with negative sampling method of word2vec amounts to factorizing a word-context co-occurrence matrix whose entires are point-wise mutual information of the respective word and context pairs
Cross-lingual Embedding w Parallel vocab
- initial study on cross-lingual embedding started with Mikolov et al 2013b noticing continuous word embedding spaces exhibit similar structure across languages, and proposed to learn a linear mapping from source to target embedding space, using parallel vocabulary of 5k words as anchor points
Cross-lingual Embedding w/o Parallel vocab
- Smith et al 2017 employ identical character strings to form a parallel vocab, limited to languages sharing common alphabet
- Cao et al 2016 employ distribution-based approach
- Zhang et al 2017b employ adversarial training
- above approaches sound appealing, but their performance is significantly below supervised methods
Contributions
- propose learning SoTA cross-lingual embedding without parallel vocab in three tasks : word translation, sentence translation retrieval and cross-lingual word similarity
- introduce cross-domain similarity adaptation method which improved the unsupervised method significantly by solving hubness problem (points tending to be nearest neighbors of many points in high-dimensional space)
- propose an unsupervised criterion that is highly correlated with quality of the cross-lingual mapping, which can be used for early stopping and hyperparamter tuning
- release high-quality dictionary for 12 oriented language pairs and open-source the code

Method

screen shot 2018-04-03 at 2 43 38 pm

Word Embedding
- learn unsupervised word embeddings using fastText (300-dim) for source and target language
Adversarial Training
- model : 2-layer FCN with hidden_size 2048 + Leaky-ReLU activation
- learn GAN with discriminator trying to detect whether the embedding is from source or target, and generator is trying to fool discriminator
- W is learnt to preserve orthogonality with update rule as below :
Refinement Procedure
- GAN gives good performance, but not on par with supervised methods due to rare words hindering the overall performance
- to refine, build a synthetic parallel vocab using W learned in GAN training on the fly - choose most frequent words and retain only mutual nearest neighbors to ensure high-quality dictionary
- apply Procrustes solution on this generated dictionary for refinement, iteratively
Cross-Domain Similarity Local Scaling (CSLS)
- to resolve hubness problem, consider a bi-partite neighborhood graph and consider the mean similarity of K nearest neighbor embeddings as :
- overall similarity measure (CSLS) is measured as
Unsupervised Criterion
- consider 10k most frequent source words, use CSLS to generate translation for each of them and compute average cosine similarity and use this average as validation metric
- this criterion correlates well with the performance of the evaluation task than Wassertein distance

Experiments

Word Translation
- applying Procrustes - CSLS in supervised manner outperforms other supervised methods
- unsupervised method proposed in this paper outperforms SoTA in P@1
- when word embedding is trained via Wiki (richer embedding), performance boosts up
Sentence Retrieval
- both supervised and unsupervised method achieves SoTA

Personal Thoughts

engineering effort to push the performance of unsupervised method to surpass that of supervised method is impressive.
still, word is word and sentence is sentence. I'd like to see how to relate this cross-lingual word embedding to sentence-level context

PDF presetned in OpenNMT WorkshopParis2018 Link : https://arxiv.org/pdf/1710.04087.pdf Authors : Conneau et al. 2018

kweonwooj / papers

Word Translation without Parallel Data #100

Abstract

Details

Introduction

Method

Experiments

Personal Thoughts