ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
358 stars 75 forks source link

Need help on training syntax #64

Closed mosynaq closed 6 years ago

mosynaq commented 6 years ago

Hi everyone. I'm trying to train a model using udpipe and I issue the following command:

../udpipe \
--train \
'modell.udpipe' \
< 'train.conllu' \
--tokenizer 'epochs=60;early_stopping=1;allow_spaces=1' \
--tagger 'templates=tagger' \
--parser 'single_root=0;iterations=20;early_stopping=1;embedding_form_file=fa.word2vec' \
--heldout dev.conllu

Tokenization finishes without any problem. But when it comes to tagger, it just occupies a huge part of RAM and does nothing, no matter what I do. What is my problem? Is it the syntax of the command? Should I provide something?

Thanks!

p.s. The word2vec model is made using gensim, though I'm not sure it is the problem.

foxik commented 6 years ago

The syntax seem fine. How long did the tagger do nothing? What data are you training to train on? What is the size of the data? How many unique UPOS/XPOS/FEATS tags does it have?

mosynaq commented 6 years ago

Hey @foxik , thank you for answering! It was looking idle for more than half a day! The size of the data is about 35 MB and it is the Persian portion of Hamledt3.

foxik commented 6 years ago

Ok -- the problem is that there are 51105 unique XPOS tags in the data, which of course makes training extremely slow. The problem is that sentence IDs are part of XPOS tags.

For the training to progress, you should either not train on XPOS tags at all (by providing use_xpostag=0 option for tagger argument), or remove the senIDs (`sed 's/|senID=[0-9]*//'); in both cases, the training progresses normally (an iteration per minute or so).

foxik commented 6 years ago

Closing.