impresso / federal-gazette

0 stars 0 forks source link

Next steps parallel corpus & multilingual embeddings #6

Closed aflueckiger closed 3 years ago

aflueckiger commented 5 years ago

@simon-clematide

Further steps parallel corpus

Further steps multilingual embeddings

Further steps phrase extraction


aflueckiger commented 5 years ago

Models /mnt/storage/clwork/projects/climpresso/local_storages/harlie/federal-gazette/embedding_vXXX

python2

fpath = '/mnt/storage/clwork/projects/climpresso/local_storages/harlie/federal-gazette/embedding_v3/biskip.300.de-fr.bin'

from multivec import MonolingualModel, BilingualModel 
model = BilingualModel(fpath) 
model.trg_model 
model.trg_model.word_vec('france') 

# compare cross-lingual
model.trg_closest('bündnerfleisch')
model.src_closest('ordinateur')
aflueckiger commented 4 years ago

The embeddings of some words with old spelling seem to be very noisy. Moreover, they are oddly similar to each other without having an apparent relation.

In [35]: model.trg_closest('abtheilung')
Out[35]: 
[('m\xc3\xb4me', 0.7755178213119507),
 ('clans', 0.769781231880188),
 ('do', 0.763927161693573),
 ('ot', 0.7473583817481995),
 ('section', 0.7450423836708069),
 ('celte', 0.7448492646217346),
 ('dn', 0.7437467575073242),
 ('ensorte', 0.7386529445648193),
 ('lo', 0.7359005212783813),
 ('sections', 0.7342149615287781)]

In [36]: model.trg_closest('bundesrath')
Out[36]: 
[('m\xc3\xb4me', 0.8397364616394043),
 ('celte', 0.8255298137664795),
 ('mr', 0.8033562898635864),
 ('\xe2\x80\x9e', 0.7990537285804749),
 ('do', 0.7959240674972534),
 ('qne', 0.7946202754974365),
 ('clans', 0.7938278913497925),
 ('sou', 0.7881290912628174),
 ('d\xc3\xa8s-lors', 0.7880147695541382),
 ('lo', 0.7871352434158325)]

In [37]: model.trg_closest('theilnahme')
Out[37]: 
[('m\xc3\xb4me', 0.7302607297897339),
 ('celte', 0.7289631366729736),
 ('clans', 0.7195239067077637),
 ('non-seulement', 0.7147542238235474),
 ('daus', 0.7087118029594421),
 ('\xe2\x80\x9e', 0.7065469622612),
 ('uu', 0.7032425403594971),
 ('d\xc3\xa8s-lors', 0.7022230625152588),
 ('sou', 0.6991384029388428),
 ('qne', 0.6973485350608826)]
simon-clematide commented 4 years ago

Maybe the older texts are not large enough?

Or should we tryout https://github.com/facebookresearch/MUSE

aflueckiger commented 4 years ago

@simon-clematide

They are definitely not OOV since I can look up them. All models are lowercase by the way. I am investigating this issue. In general, the cross-lingual embeddings work nicely.

I will also have a look at MUSE, yet I am rather skeptical that they work better. So far we leverage the parallelism of the corpus instead of simply performing a matrix transformation with Procrustes.

EDIT: may be not enough material as you mention. However, I am also skeptical about this.

aflueckiger commented 4 years ago

Cross-Lingual Embeddings

This is an evaluation of the best cross-lingual alignments with the supervised MUSE library based on kNN with cosine similarity (corresponding CSLS value, maybe slightly higher in other iteration). The evaluation is performed on the bilingual dictionary provided in MUSE (out-of-domain).

Please note that the comparison between FastText and MultiVec is not meaningful currently due to different minimal occurrences, thus other coverage of validation. Moreover, the randomization of the corpus is not the same for all experiments

other non-standard hyperparameters of the latest models

FastText: -thread 20 -minCount 5 -ws 10 -dim 300 -minn 3 -maxn 6 -epoch 50 MultiVec: --window-size 10 --sg

Evaluation

Out-of-Domain Evaluation DE -> FR

Model Dim Train Params Eval Params Aligning NN Precision @ 1 NN Precision @ 5 CSLS Precision @ 1 CSLS Precision @ 5
FastText_v1 300 no subword, iter 10, min-count 5 max vocab 200k MUSE 51.97 67.78 61.97 74.70
FastText_v2 300 subword 3-6, iter 50, min-count 5 max vocab 200k MUSE 54.06 68.65 61.58 74.38
FastText_v2 300 subword 3-6, iter 50, min-count 5 max vocab 200k RCSLS 56.36 71.73 62.94   74.47
multivec_v4 300 iter 20, min-count 10 max vocab 200k parallelism + MUSE 68.22 78.29 73.02 83.88
multivec_v4 300 iter 20, min-count 10 max vocab 200k parallelism 69.30 78.76 73.49 84.81
multivec_v5 300 iter 50, min-count 10 max vocab 200k parallelism 69.66 78.33 73.37 84.52
multivec_v6 300 iter 50, min-count 5 max vocab 200k parallelism 65.07 78.21 71.64 85.08
multivec_v6 300 iter 50, min-count 5 entire vocab parallelism 67.48 77.93 74.0 86.30
multivec_v7 150 iter 20, min-count 5 max vocab 200k parallelism 61.79 74.63 65.67 80.23
--- --- --- --- --- --- --- --- ---
multivec_new_corp_v1 100 iter 20, min-count 5 entire vocab parallelism 66.32 78.06 68.22 80.48
multivec_new_corp_v1 150 iter 20, min-count 5 entire vocab parallelism 69.95 82.21 72.62 85.31
multivec_new_corp_v1 300 iter 20, min-count 5 entire vocab parallelism 69.43 79.70 75.22 86.70

In-Domain Evaluation DE -> FR

Model Dim Train Params Eval Params Aligning NN Precision @ 1 NN Precision @ 5 CSLS Precision @ 1 CSLS Precision @ 5
multivec_v6 300 iter 50, min-count 5 entire vocab parallelism 43.02 59.88 39.53 56.98
--- --- --- --- --- --- --- --- ---
multivec_new_corp_v1 100 iter 20, min-count 5 entire vocab parallelism 38.06 58.06 36.12 52.90
multivec_new_corp_v1 150 iter 20, min-count 5 entire vocab parallelism 42.58 65.16 41.29 65.80
multivec_new_corp_v1 300 iter 20, min-count 5 entire vocab parallelism 49.03 66.45** 47.10 68.39

The bilingual in-domain dictionary was created by sampling from the Moses Translation file with the following command. more mt_moses/train_de_fr/model/lex.e2f | awk '{print $3" "$2" "$1}' | uniq -f2 | egrep "0\.0{1,2}[1-9][0-9]+ [a-z]{7,15} [a-z]{7,15}" | sort -r | more | shuf -n 10000 | awk '{print $3" "$2}' | egrep -v "-" > ../../dico-de-fr-sample.txt To provide high-quality translations the sampled items where manually filtered and extended with some prominent translations. The final dataset consists of 191 pairs of words (DE-FR).

provisional conclusions

aflueckiger commented 3 years ago

Old Embeddings

Evaluation of German Embeddings

We have trained various models with different hyperparameters, pre-processing, and data. Below there is an evaluation of the jointly trained German Embeddings. The evaluation is performed on lowercased analogies of a generic dataset (i.e., out of domain evaluation).

python3 lib/eval_embeddings.py $FILE.VEC --lowercase

embedding_v1 data: only FedGaz preprocessing: --lowercase --normalize-digits --normalize-punk --min-count 10 dimension: 100/300 window size: 5 min count: 10 iterations: 10

embedding_v2 data: FedGaz + Europarl preprocessing: --lowercase --normalize-digits --normalize-punk --min-count 10 sed "s/[^[:alnum:] _-]/ /g" | sed "s/ +/ /g" ---> remove all non-alphanumeric characters dimension: 100/300 window size: 10 min count: 10 iterations: 20

embedding_v3 data: FedGaz + Europarl preprocessing: --lowercase --normalize-digits --min-count 5 sed "s/ [^[:alnum:] _-]+/ /g" | sed "s/ +/ /g" ---> remove all non-alphanumeric tokens (instead of characters) dimension: 100/300 window size: 10 min count: 10 iterations: 20

standard parameters for all of the models alpha: 0.05 iterations: 10 threads: 10 subsampling: 0.001 skip-gram: true HS: false negative: 5

Model Performance Number of correct answered analogies/questions for various models in German.

Model Syntactic opposite bestmatch doesn't fit
embedding_v1_100 24.7% (1797/7270) 15.3% (46/300) 14.3% (66/462) 80.7% (71/88)
embedding_v1_300 29.6% (2149/7270) 15.7% (47/300) 24.0% (111/462) 81.8% (72/88)
embedding_v2_100 26.1% (2058/7897) 16.3% (49/300) 22.9% (113/494) 83.3% (75/90)
embedding_v3_100 22.6% (1810/8022) 14.0% (42/300) 17.7% (95/538) 78.9% (71/90)
embedding_v3_300 31.4% (2518/8022) 19.0% (57/300) 32.5% (175/538) 83.3% (75/90)

Evaluation examples: opposite: Frage Antwort stark schwach best match: China Yuan Deutschland Euro doesn't fit: August April September Jahr

aflueckiger commented 3 years ago

Incorporated in README