ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
358 stars 75 forks source link

problem with spaces in form column in training tagger #21

Closed ftyers closed 7 years ago

ftyers commented 7 years ago
$ cat ~/source/apertium/languages/apertium-kaz/texts/puupankki/puupankki.kaz.conllu kk-ud-dev.conllu | ~/source/udpipe/src/udpipe --tokenizer epochs=5 --train kaz2.udpipe
Loading training data: done.
Training the UDPipe model.
Epoch 1, logprob: -6.6085e+04, training acc: 96.42%
Epoch 2, logprob: -1.1740e+04, training acc: 99.03%
Epoch 3, logprob: -6.4117e+03, training acc: 99.53%
Epoch 4, logprob: -4.9325e+03, training acc: 99.65%
Epoch 5, logprob: -3.9941e+03, training acc: 99.71%
Creating morphological dictionary for tagger model 1.
An error occurred during model training: Cannot parse replacement rule '  ған емес ' in statistical guesser file!

The offending sentence is:

# sent_id = akorda-random.tagged.txt:209:3751
# text = - Біздің елдеріміз арасында ешқашан ешқандай да қайшылықтар болған емес.
1       -       -       PUNCT   guio    _       8       punct   _       _
2       Біздің  біз     PRON    prn     Case=Gen|Number=Plur|Person=1|PronType=Prs      3       nmod:poss       _       _
3       елдеріміз       ел      NOUN    n       Case=Nom|Number=Plur|Number[psor]=Plur|Person[psor]=1   4       nmod:poss       _       _
4       арасында        ара     NOUN    n       Case=Loc|Number[psor]=Plur,Sing|Person[psor]=3  8       obl     _       _
5       ешқашан ешқашан ADV     adv     _       8       advmod  _       _
6-7     ешқандай да     _       _       _       _       _       _       _       _
6       ешқандай        ешқандай        DET     det     PronType=Neg    8       det     _       _
7       да      да      ADV     postadv _       8       advmod  _       _
8       қайшылықтар     қайшылық        NOUN    n       Case=Nom|Number=Plur    0       root    _       _
9       болған емес     бол     AUX     v       Mood=Ind|Number=Sing|Person=3|Polarity=Neg|Tense=Past|VerbForm=Fin      8       cop     _       SpaceAfter=No
10      .       .       PUNCT   sent    _       8       punct   _       _

It seems to be a problem with spaces in the form column, but this should be valid CoNLL-U.

foxik commented 7 years ago

Spaces in FORM and LEMMA are allowed only in CoNLL-U v2, they were disallowed in CoNLL-U v1 - therefore, they are not supported by UDPipe v1.0.

UDPipe v1.1 supporting CoNLL-U v2 will be released in ~3 weeks.

Leaving this open until version of UDPipe supporting CoNLL-U v2 is released.

foxik commented 7 years ago

UDPipe 1.1 has been released, allowing spaces in FORMs and LEMMAs.