Script for training embeddings

glample / tagger

Named Entity Recognition Tool

Apache License 2.0

1.16k stars 426 forks source link

Script for training embeddings #77

Closed sa-j closed 6 years ago

sa-j commented 6 years ago

Hi there,

Thanks for uploading the NER Tagger! I'm trying to build on the performance of your model for German. You already provided the pre-trained embeddings in issue #44 , however, I want to extend your corpus with some more text. Is it possible for you to upload the script with which the embeddings were produced?

Thank your very much!

@glample @pvcastro @julien-c

pvcastro commented 6 years ago

Sorry, I'm just working with Portuguese language, can't help you with scripts for German!

sa-j commented 6 years ago

Ok!

I'm actually looking for the original script with which the embeddings were trained on the Leipzig corpora collection & German monolingual training data from 2010 Machine Translation (according to the paper).

glample commented 6 years ago

Hi,

We trained our embeddings using the wang2vec model, you can find it here: https://github.com/wlin12/wang2vec

sa-j commented 6 years ago

Thank you! And do you have your preprocessing script with which you produced the texts for wang2vec? I want to exactly reproduce the GER64 embeddings (and therefore the results) for the NER tagger.

glample commented 6 years ago

Sorry I don't remember about the preprocessing :/ But I think we only used the Moses tokenizer: https://github.com/moses-smt/mosesdecoder/

sa-j commented 6 years ago

Ok, thank you! And what about the the param settings for wang2vec, including the window size (should be different for Geman than for, say, English).

./word2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0

Do you have them?

glample commented 6 years ago

Parameters can be same for all languages. You should use -type 3 and also -size 50 is the dimension of your embeddings so you probably want more than that. GER64 uses 64 typically, but higher might be better.

sa-j commented 6 years ago

Which versions of the Leipzig corpora collections have you used? Minus "web", there are 4 text sources ("wiki,news,newscrawl,mixed") each consisting 30k to 1M sentences. Have you by chance used only the 1M variants with their most recent entry and merged all 4 documents?

glample commented 6 years ago

Sorry I don't remember. I would just use everything.