Closed sa-j closed 6 years ago
Sorry, I'm just working with Portuguese language, can't help you with scripts for German!
Ok!
I'm actually looking for the original script with which the embeddings were trained on the Leipzig corpora collection & German monolingual training data from 2010 Machine Translation (according to the paper).
Hi,
We trained our embeddings using the wang2vec model, you can find it here: https://github.com/wlin12/wang2vec
Thank you! And do you have your preprocessing script with which you produced the texts for wang2vec? I want to exactly reproduce the GER64 embeddings (and therefore the results) for the NER tagger.
Sorry I don't remember about the preprocessing :/ But I think we only used the Moses tokenizer: https://github.com/moses-smt/mosesdecoder/
Ok, thank you! And what about the the param settings for wang2vec, including the window size (should be different for Geman than for, say, English).
./word2vec -train input_file -output embedding_file -type 0 -size 50 -window 5 -negative 10 -nce 0 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 5 -cap 0
Do you have them?
Parameters can be same for all languages. You should use -type 3
and also -size 50
is the dimension of your embeddings so you probably want more than that. GER64 uses 64 typically, but higher might be better.
Which versions of the Leipzig corpora collections have you used? Minus "web", there are 4 text sources ("wiki,news,newscrawl,mixed") each consisting 30k to 1M sentences. Have you by chance used only the 1M variants with their most recent entry and merged all 4 documents?
Sorry I don't remember. I would just use everything.
Hi there,
Thanks for uploading the NER Tagger! I'm trying to build on the performance of your model for German. You already provided the pre-trained embeddings in issue #44 , however, I want to extend your corpus with some more text. Is it possible for you to upload the script with which the embeddings were produced?
Thank your very much!
@glample @pvcastro @julien-c