ufal / udpipe

UDPipe: Trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files
Mozilla Public License 2.0
358 stars 75 forks source link

Strange annotation view #58

Closed kirdin closed 6 years ago

kirdin commented 6 years ago

Hi everyone,

I'm sorry if the following problem has already been discussed or if it's happened just due to a mistake of mine. However, I don't know how to fix it and maybe it'd be useful for somebody else.

I tried to train my own model on the Syntagrus corpus (a full pipeline from tokenization to parsing) implying morphological dictionary. This dictionary follows proposed format, see an excerpt:

screen shot 2018-01-10 at 11 04 27

The model was trained and it shows satisfying results. However, sometimes the output annotation looks strange, see:

screen shot 2018-01-10 at 11 08 21

So for some words there are only five columns instead of ten, their order is confused and some letters disappeared.

As far as I can understand it, this happens with words appended to dictionary trained by model. Has anyone dealt with it?

Thanks in advance!

kirdin commented 6 years ago

By now it seems like it was my mistake -- I missed POS_tag for some words (so this field was empty, even without an underscore sign), but I'll check this hypothesis

foxik commented 6 years ago

Empty columns in the dictionary does not matter -- I replace "_" entries to empty values anyway. And I check that there are always 5 columns in the dictionary file.

I believe you have CRLF newlines in the dictionary file, but you run UDPipe on Linux/Mac. Therefore, the CR is kept as part of FEATS column, and when being printed, the CR moves the cursor to the beginning of the line and the rest of the columns (HEAD, DEPREL, DEPS, MISC) overwrite the beginning of the line. However, because of TABs, some parts of the original line remains. That is why the FORM starts with 0, then continues with the other letters of the form, and then there is "root" on the position where a TAB starting after the 0 ends, etc. Just remove the '\r' from your dictionary and you will be fine :-)

Closing the issue; if I am mistaken, feel free to reopen.

kirdin commented 6 years ago

It worked for me perfectly, thanks a lot!