propose a training framework that builds a bilingual dictionary between two languages without using any parallel corpora, by aligning monolingual word embedding spaces in an unsupervised way
outperforms existing supervised methods on cross-lingual tasks for some language pairs
Details
Introduction
Word Embedding
word2vec was proposed by Mikoliv et al 2013a for learning distributed representation of words in unsupervised manner
Levy & Goldberg et al 2014 showed that the skip-gram with negative sampling method of word2vec amounts to factorizing a word-context co-occurrence matrix whose entires are point-wise mutual information of the respective word and context pairs
Cross-lingual Embedding w Parallel vocab
initial study on cross-lingual embedding started with Mikolov et al 2013b noticing continuous word embedding spaces exhibit similar structure across languages, and proposed to learn a linear mapping from source to target embedding space, using parallel vocabulary of 5k words as anchor points
Cross-lingual Embedding w/o Parallel vocab
Smith et al 2017 employ identical character strings to form a parallel vocab, limited to languages sharing common alphabet
above approaches sound appealing, but their performance is significantly below supervised methods
Contributions
propose learning SoTA cross-lingual embedding without parallel vocab in three tasks : word translation, sentence translation retrieval and cross-lingual word similarity
introduce cross-domain similarity adaptation method which improved the unsupervised method significantly by solving hubness problem (points tending to be nearest neighbors of many points in high-dimensional space)
propose an unsupervised criterion that is highly correlated with quality of the cross-lingual mapping, which can be used for early stopping and hyperparamter tuning
release high-quality dictionary for 12 oriented language pairs and open-source the code
Method
Word Embedding
learn unsupervised word embeddings using fastText (300-dim) for source and target language
Adversarial Training
model : 2-layer FCN with hidden_size 2048 + Leaky-ReLU activation
learn GAN with discriminator trying to detect whether the embedding is from source or target, and generator is trying to fool discriminator
W is learnt to preserve orthogonality with update rule as below :
Refinement Procedure
GAN gives good performance, but not on par with supervised methods due to rare words hindering the overall performance
to refine, build a synthetic parallel vocab using W learned in GAN training on the fly - choose most frequent words and retain only mutual nearest neighbors to ensure high-quality dictionary
apply Procrustes solution on this generated dictionary for refinement, iteratively
Cross-Domain Similarity Local Scaling (CSLS)
to resolve hubness problem, consider a bi-partite neighborhood graph and consider the mean similarity of K nearest neighbor embeddings as :
overall similarity measure (CSLS) is measured as
Unsupervised Criterion
consider 10k most frequent source words, use CSLS to generate translation for each of them and compute average cosine similarity and use this average as validation metric
this criterion correlates well with the performance of the evaluation task than Wassertein distance
Experiments
Word Translation
applying Procrustes - CSLS in supervised manner outperforms other supervised methods
unsupervised method proposed in this paper outperforms SoTA in P@1
when word embedding is trained via Wiki (richer embedding), performance boosts up
Sentence Retrieval
both supervised and unsupervised method achieves SoTA
Personal Thoughts
engineering effort to push the performance of unsupervised method to surpass that of supervised method is impressive.
still, word is word and sentence is sentence. I'd like to see how to relate this cross-lingual word embedding to sentence-level context
Abstract
Details
Introduction
word2vec
was proposed by Mikoliv et al 2013a for learning distributed representation of words in unsupervised mannerword2vec
amounts to factorizing a word-context co-occurrence matrix whose entires are point-wise mutual information of the respective word and context pairshubness
problem (points tending to be nearest neighbors of many points in high-dimensional space)Method
W
is learnt to preserve orthogonality with update rule as below :W
learned in GAN training on the fly - choose most frequent words and retain only mutual nearest neighbors to ensure high-quality dictionaryhubness
problem, consider a bi-partite neighborhood graph and consider the mean similarity of K nearest neighbor embeddings as :Experiments
Procrustes - CSLS
in supervised manner outperforms other supervised methodsPersonal Thoughts
PDF presetned in OpenNMT WorkshopParis2018 Link : https://arxiv.org/pdf/1710.04087.pdf Authors : Conneau et al. 2018